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Abstract 
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Biometrics  technologies  have  grown  considerably  in  recent  years  with  better  computing 
and  an  expanding  realm  in  which  these  tools  are  deployed.  Among  these,  iris  recognition 
demonstrates  superior  performance  as  a  biometric,  perhaps  far  exceeding  the  standard  fingerprint 
recognition  of  past  decades.  Unfortunately,  iris  recognition  is  very  computationally  intensive, 
requiring  near  state-of-the-art  traditional  processing  methods.  Because  of  the  complexity  of  iris 
recognition  systems,  many  portable  iris  scanners  are  bulky,  cumbersome  and  very  expensive, 
often  requiring  laptop  computers  to  carry  out  the  computations.  This  is  due  to  a  reliance  on 
sequential  processing,  the  manner  of  computing  we  see  in  a  typical  personal  computer.  However, 
there  is  an  alternative  with  parallel  processing  using  multicore  processors,  field-programmable 
gate  arrays  (FPGAs)  or  application  specific  integrated  circuits  (ASICs).  These  devices  can  speed 
algorithms  through  parallel  processing.  Taking  the  algorithm  developed  by  Dr.  Robert  Ives  et  al. 
of  the  United  States  Naval  Academy  for  iris  recognition,  parallelizable  parts  of  the  algorithm  can 
be  translated  for  parallel  processing.  A  parallel  version  of  the  algorithm  may  be  substantially 
faster  and  physically  much  smaller  implemented.  This  implementation  is  placed  into  an  FPGA 
system  in  order  to  evaluate  the  performance  of  specific  parts  of  the  algorithm  converted  from 
sequential  C  code  to  parallel  hardware  logic  with  respect  to  speed  and  hardware  footprint. 
Additionally,  this  project  seeks  to  evaluate  the  feasibility  of  an  entirely  embedded  iris 
recognition  system  comprised  of  both  sequential  C  software  and  parallel  hardware  on  a  single 
chip  and  a  discrete  memory  module.  The  resulting  hardware  is  shown  to  be  between  10  and  1000 
times  faster  than  current  methods  while  being  entirely  embedded  and  independent  of  a  host 
system  for  processing.  These  chips  could  be  deployed  to  offer  a  handheld,  high-speed  and  palm¬ 
sized  iris  recognition  system  with  all  the  necessary  functionality  expected  in  a  commercial  iris 
recognition  system. 


Keywords:  Biometrics,  Iris  Recognition,  Systolic  Architecture,  Finite  Impulse  Response 
Filtering,  Binary  Morphology 
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Chapter  1 
Introduction 


Biometric  identification  has  exploded  in  recent  years  as  the  next  generation  of  security 
and  authentication  systems.  Biometrics  is  the  study  and  measurement  of  the  physical 
characteristics  that  define  an  individual’s  “uniqueness.”1  For  example,  a  common  form  of 
biometrics  that  humans  practice  daily  is  facial  recognition.  In  our  everyday  interactions,  we  are 
able  recognize  friends  and  family  by  their  familiar  faces,  which  we  measure  and  interpret  with 
our  eyes.  Humans  can  also  readily  recognize  familiar  people  by  their  height  and  body  shape, 
voice  or  even  fashion  sense.  Biometrics  sums  up  all  these  traits  and  many  more  for  qualitative 
or,  in  the  case  of  computers,  quantitative  analysis. 

While  humans  have  mastered  the  art  of  biometric  analysis  through  our  natural  means, 
computers  offer  us  exceptionally  powerful  means  of  recognition  that  we  have  only  recently  been 
able  to  explore.  For  example,  humans  lack  the  capacity  to  recognize  the  fine  details  of  a  friend’s 
iris  beyond  eye  color,  but  a  computer  can  document  the  complex,  unique  pattern  of  the  iris  and 
recall  it  later  for  recognition.  This  process  of  recording  biometric  data  into  a  computer  is  called 
enrollment.  Computers  have  also  proven  capable  of  other  biometric  identification  tasks  such  as 
fingerprinting,  DNA  analysis,  facial  recognition,  and  even  palm  venous  structure  recognition.2 
Also,  with  the  exception  of  DNA  analysis,  computers  obtain  most  of  this  data  in  unobtrusive 
ways,  resulting  in  no  harm  to  the  person  being  measured.  In  some  cases,  a  computer  can  enroll 
and  recognize  a  person  on  the  fly,  which  is  the  focus  of  portable  iris  recognition  systems.3 

Iris  recognition  stands  out  as  one  of  the  most  reliable  and  growing  biometric  methods  in 
existence.  The  iris  is  an  internal  part  of  the  eye  that  is  protected  by  the  cornea,  sustaining  its 
appearance  after  decades  even  after  facial  features  and  fingerprints  have  changed.  It  consists  of 
the  sphincter  pupillae  and  the  dilator  pupillae  muscles  to  vary  the  diameter  of  the  pupil. 
Together,  these  muscles  span  a  flat  surface  that  exhibits  a  highly  unique  pattern  with  nearly 
infinite  variability.4  An  example  iris  photograph  is  depicted  below. 
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Figure  1.1:  Typical  human  iris.  Also  visible  are  eyelids,  eyelashes,  and  glare. 


Furthermore,  biometric  technology  advancements  have  made  iris  recognition  systems 
nearly  impossible  to  fool  with  high  resolution  photographs,  printed  contact  lenses  or  even  more 
menacing  means.  The  iris  can  be  photographed  at  high  resolution  at  a  distance  of  up  to  a  few 
meters,  even  through  glasses  or  contact  lenses  and  in  varying  light  conditions  which  can  alter  the 
visible  area  of  the  iris. 

Basic  iris  recognition  systems  photograph  the  iris  using  near  infrared  (NIR)  light  and 
process  it  for  matching.  Infrared  light  can  be  captured  by  a  typical  camera  with  a  special  filter 
lens  and  allows  the  camera  to  capture  the  structural  features  of  the  iris  unimpeded  by 
pigmentation  or  color  lighting  conditions.  Once  a  digital  image  of  the  iris  is  captured,  the  system 
begins  processing  the  image  via  algorithms  to  transform  it  from  a  two  dimensional  array  of 
pixels  (picture  elements)  to  a  one  dimensional  encoded  string  of  bits  for  comparison.  This 
transformation  from  pixels  to  information  is  a  common  digital  signal  processing  technique.  Iris 
recognition  systems  assume  that  the  features  of  the  iris  are  so  unique  that  they  are  like  a 
biological  password.  This  password  is  the  information  contained  in  the  iris’  structural  features 
and  can  be  determined  by  noting  the  frequency  and  phase  information  of  the  iris’  features  such  as 
in  John  Daugman’s  algorithm  or  other  possible  methods  that  will  be  explored  further  in  this 
paper.  Different  iris  recognition  algorithms  exist  but  Dr.  Robert  Ives  et  al.  of  the  United  States 
Naval  Academy  has  developed  an  algorithm,  known  as  the  Ridge  Energy  Direction  (RED) 
algorithm,  which  has  proven  a  reliable  method  for  iris  recognition.5  In  this  algorithm,  the  first 
step  is  to  identify  the  iris  among  other  facial  elements  such  as  the  eyelids,  sclera  (white  part  of 
the  eye),  pupil  (dark  circle  in  the  center  of  the  eye)  and  eyelashes.  The  algorithm  accomplishes 
this  by  scanning  the  image  for  the  center  of  the  pupil  using  several  segmentation  techniques  and 
then  uses  a  statistical  approach  to  identify  the  outer  radius  of  the  iris  (limbic  boundary).  This 
establishes  the  central  point  of  the  iris  within  the  image’s  x  and  y  coordinates,  allowing  the 
computer  to  extract  only  the  meaningful  portions  of  the  iris. 
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Figure  1.2:  Measurable  iris  area  detected  by  the  algorithm.  Also  visible  is  the  associated 
one  dimensional  encoding  of  the  iris  image  in  the  top  left.6 


The  next  step  is  to  encode  the  iris  image  from  two  dimensions  down  to  one,  also  known 
as  a  template,  in  order  to  compare  the  iris  to  a  previously  enrolled  iris  template.  A  template  is  a 
bit  vector  that  uniquely  represents  the  information  contained  in  the  iris.  In  biometric  algorithms, 
it  is  necessary  to  record  the  information  that  the  pixels  of  an  image  convey  rather  than  the  pixels 
themselves.  This  is  because  pixels  will  differ  from  one  image  to  another  but  the  information 
should  be  consistent  if  the  iris  within  the  image  is  in  fact  the  same.  In  the  RED  algorithm,  the  iris 
is  unwrapped  using  polar  coordinates  and  the  energy  of  each  pixel  is  measured  (the  energy  of 
each  pixel  is  merely  the  square  of  the  value  of  the  infrared  intensity  within  the  pixel).  Then,  the 
energy  of  m  x  n  pixels  is  passed  into  four  filters  to  emphasize  the  existence  of  ridges  and  their 
orientation.  The  filter  with  the  highest  output  indicates  this  orientation  and  the  algorithm 
represents  the  orientation  using  a  single  bit  which  is  passed  into  the  template.  This  process  is 
demonstrated  below. 
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IR  Pixel  Map  Corresponding  Energy  Map 


Figure  1.3a:  Example  transformation  from  pixel  map  to  energy  map  (4x4  block  of  pixels) 
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Figure  1.3b:  Demonstration  of  how  the  filtering  is  applied  to  the  energy  map  and  the 
corresponding  bits  that  are  passed  into  the  template  representing  a  piece  of  the  iris.  Note: 
this  is  not  matrix  multiplication  but  instead  element  by  element  multiplication  and  then 

summation  for  comparison  (convolution). 7 
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The  characterization  ignores  variations  in  image  size  by  scaling  the  area  of  interest  with 
respect  to  the  actual  size  of  the  captured  iris.  This  is  intuitive  due  to  the  fact  that  everyone’s  iris 
is  roughly  the  same  size  and  actual  dimensions  are  assumed  and  normalized.  The 
characterization  encodes  only  2,048  bits  or  256  bytes  in  the  RED  algorithm,  like  the  famous 
algorithm  written  by  Dr.  John  Daugman,  which  records  enough  data  to  distinguish  between  all 
the  irises  in  the  world  more  than  billions  of  times  over.*  Therefore,  the  algorithm  requires  1,024 
iterations  of  this  process  to  encode  a  full  iris  image  into  a  template  which  can  be  a  slow  and 
demanding  computation  using  a  single  sequential  process. 

Once  encoded,  the  iris  recognition  system  must  be  able  to  reliably  match  the  result  with  a 
previously  enrolled  person.  The  newly  encoded  iris  and  its  previously  enrolled  mate  (or  not)  are 
matched  using  Hamming  Distance  (HD)  and  their  associated  masks  (used  to  omit  bits  of  data 
corrupted  by  eyelashes  or  noise).  Hamming  distance  is  calculated  within  the  algorithm  using  the 
following  Boolean  logic  formula: 


HD= 


( codeA  ©  codeB )  n  maskA  n  maskB 
II maskA  n  maskB\\ 


(1.1) 


In  this  formula,  Hamming  distance  is  measured  by  first  seeing  how  many  bits  between  A 
and  B  agree  and  then  modulating  out  those  bits  that  may  be  corrupted.  The  mask  bits  for  a 
template  (codeA  or  codeB  here)  are  determined  simultaneously  with  the  template.  A  mask  bit 
signifies  corrupted  data  because  at  the  time  of  filtering,  the  template  generation  process  could  not 
determine  a  definite  horizontal  or  vertical  ridge.  Sidestepping  an  in  depth  statistical  analysis  of 
Hamming  distance,  the  matching  algorithm  resembles  a  binomial  distribution  of  249  degrees  of 
freedom  (N=249)  with  the  chance  of  a  1  or  0  in  the  encoded  iris  being  p  =  .500.  From  this 
understanding,  the  system  can  be  fine  tuned  to  accept  varying  HD  values  that  correspond  to 
different  degrees  of  false  acceptance  rate  (FAR)  and  false  rejection  rate  (FRR)  using  the 
equivalent  negative  binomial  distribution.9  Lower  Hamming  Distance  correlates  to  exponentially 
lower  odds  of  a  false  match. 

Each  step  of  the  algorithm  requires  varying  times  of  execution.  Depending  on  the  number 
of  templates  that  need  to  be  matched,  template  matching  can  require  the  most  time.  Matching  the 
information  of  a  few  people  such  as  those  in  a  household  or  a  small  business  would  take  a 
negligible  amount  of  time  while  a  database  of  an  entire  nation’s  people  would  require  significant 
time  to  process.  The  other  major  part  of  the  algorithm  to  consume  processing  resources  is  the 
segmentation  process.  To  accurately  detect  the  location  of  an  iris  within  an  image,  the  algorithm 
uses  extensive  image  processing  techniques.  Image  processing  without  the  assistance  of  a  co¬ 
processor  such  as  a  video  card  in  a  general  purpose  system  can  be  noticeably  slow.  This 
processing  time  can  even  be  more  pronounced  when  migrating  the  iris  recognition  system  from  a 
general  purpose  system  with  a  high  efficiency  and  high  speed  processor  to  a  slower  “mobile” 
solution. 
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Chapter  2 
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2. 1  Field-Programmable  Gate  Arrays  and  Programmable  Logic 
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Figure  2.1:  Adaptive  Logic  Modules  (ALMs)  of  the  Stratix  III  FPGA  are  general  purpose 
logic  elements  and  are  physically  inside  the  Stratix  III.  When  the  FPGA  is  configured,  these 
logic  elements  are  interconnected  by  the  1000’s  to  implement  any  higher  level  hardware 

function.11 


Designing  and  prototyping  integrated  circuits  were  once  a  challenging  and  extremely 
expensive  process.  In  the  past,  all  integrated  circuits  designed  for  a  specific  application 
(application  specific  integrated  circuits  or  ASICs)  had  to  be  carefully  designed  to  avoid  flaws 
and  once  fabricated,  could  not  be  changed.  Recently,  with  the  development  of  flash  densities  and 
fabrication  processes,  vendors  such  as  Xilinx  and  Altera  have  begun  producing  integrated 
circuits  that  have  general  purpose  logic  such  as  the  logic  seen  in  Figure  2.1  with  programmable 
interconnectivity  inside  the  chip  shown  in  Figure  2.2  of  the  Stratix  III  architecture.  These  chips 
are  called  programmable  logic  devices  (PLDs)  because  the  connections  between  the  internal 
logic  elements  can  be  programmed  much  like  loading  data  into  flash  memory.  The  logic 
elements  of  these  PLDs  are  designed  to  serve  any  function  from  basic  ‘AND’  and  ‘OR’  gates  to 
adders  and  even  registers  and  memory.  Thus,  an  engineer  has  the  opportunity  to  configure  the 
interaction  and  connectivity  of  the  logic  elements  long  after  the  PLD  has  been  fabricated,  a  huge 
step  forward  from  traditional  hardware  design  with  ASICs. 
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Figure  2.2:  Interconnection  within  the  Stratix  III  FPGA 


Larger  PLDs  with  more  advanced  internal  logic  elements  and  design  are  called  complex 
programmable  logic  devices  (CPLDs)  and  even  more  advanced  programmable  logic  with  greater 
applicability  are  called  field  programmable  gate-arrays  (FPGAs).  While  PLDs  were  simple  and 
had  limited  applicability  in  production  level  designs,  FPGAs  are  getting  so  advanced  that  they 
are  able  to  compete  with  high  level  systems,  an  area  once  exclusively  the  realm  of 
microprocessors,  and  are  even  being  incorporated  into  supercomputers  as  co-processors. 
Someday,  FPGAs  or  their  future  iterations  may  become  so  advanced  that  they  fully  supplant  the 
need  for  ASIC  design  leaving  only  high  end  microprocessors  and  FPGAs. 

This  rise  in  FPGA  capability  has  given  the  electrical  engineering  design  community  an 
entirely  new  venue  to  build  and  test  complex  machines.  Thus,  not  only  are  FPGAs  themselves  a 
focus  of  research  but  also  provide  a  foundation  for  research  in  hardware  design.  State-of-the-art 
FPGAs  allow  rapid  prototyping  of  entire  systems  in  minutes  through  intuitive  design  software 
such  as  Altera’s  Quartus  II  software  and  hardware  description  languages.  Hardware  description 
languages  such  as  Verilog  Hardware  Description  Language  (Verilog  HDL)  and  Very-high  speed 
integrated  circuit  Hardware  Description  Language  (VHDL)  can  describe  combination  properties 
of  hardware  (logic  gates).  Hardware  description  languages  allow  engineers  to  describe  the 
transformation  of  data  from  register  to  register  inside  the  FPGA,  inferring  logic  such  as  adders 
and  multiplexers  along  the  way. 

This  research  uses  an  FPGA  to  implement  and  prototype  an  entire  system,  designing 
hardware  specifically  targeted  at  iris  recognition  with  the  RED  algorithm.  Altera  design  software 
allows  the  use  of  Intellectual  Property  (IP)  Core  technology  consisting  of  encrypted  hardware 
description  files  that  instantiate  into  the  FPGA  many  common  and  advanced  components 
specially  designed  for  Altera  devices.  For  example,  rather  than  designing  an  entire  32  bit 
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microprocessor  and  its  C  compiler  from  the  ground  up,  we  can  configure  the  FPGA  with  Altera’ s 
already  tried  and  true  Nios  II  processor.  Subsequently,  we  can  also  connect  peripheral 
components  either  user-defined  or  pre-defined  in  the  Altera  Quartus  II  software  as  memory 
mapped  input  and  output  (I/O)  for  the  Nios.  The  Nios  II  processor  and  the  peripheral  components 
together  are  called  an  embedded  system  and  this  will  be  investigated  in  more  detail  later.  An 
example  Nios  II  system  is  shown  in  Figure  2.3. 
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Figure  2.3:  Example  Nios  based  embedded  system 


In  the  above  example,  the  Nios  II  is  instantiated  with  three  peripheral  components  that 
serve  various  functions.  The  SRAM  Controller  is  a  common  component  that  would  give  the  Nios 
II  processor  access  to  program  and  data  memory.  Typically,  a  Nios  based  system  requires  at  least 
one  memory  peripheral  for  program  and/or  data  memory  since  these  are  essential  to  any 
microprocessor  that  executes  instructions.  The  timer  can  be  set  up  to  send  an  interrupt  to  the 
Nios  II  based  on  a  predefined  interval  relative  to  the  clock  speed  of  the  timer  component.  The 
timer  can  be  configured  to  interrupt  at  1  microsecond  intervals  by  counting  up  to  50  with  a  50 
MHz  clock  at  the  timer  clock  input.  A  timer  is  useful  for  calculating  execution  times  (particularly 
for  this  research)  by  opening  up  the  “ctime”  library  to  the  Nios  processor.  Lastly,  user  defined 
logic  can  be  incorporated  into  a  Nios  based  system  to  perform  application  specific  computations. 
An  engineer  could  design  any  type  of  component  from  his/her  own  SRAM  controller  or  timer  to 
much  more  advanced  components  such  as  a  direct  memory  access  (DMA)  controller  or  a 
VGA/HDMI  interface  for  the  Nios.  All  of  this  is  possible  because  of  the  flexibility  of  Altera’s 
Avalon  Bus  which  brings  all  these  individual  components  together  into  a  single  system.  The 
Avalon  Bus  bridges  the  Nios  and  its  peripheral  components  by  assigning  base  addresses  and 
address  ranges  to  each  component.  The  Nios  can  then  address  the  components  and  talk  to  them 
by  reading  and  writing  directly  into  the  component  from  the  Avalon  Bus. 

2.2  Software  vs.  Hardware  Implementation 


Functions  can  be  implemented  in  either  software  or  hardware.  Typically,  software  is  at  a 
higher  level  than  any  hardware  because,  to  put  it  simply,  software  is  a  roadmap  to  how  existing 
hardware  in  a  microprocessor  works.  Software  is  made  up  of  instructions  that  engage  different 
parts  of  a  microprocessor’s  hardware  in  a  step  by  step  process  to  accomplish  more  complex 


14 

functions  than  with  the  individual  hardware  components  alone.  For  example,  a  hardware  adder 
can  only  ‘add’  but  with  a  set  of  instructions  an  adder  could  be  used  repeatedly  to  multiply  or 
divide.  Flowever,  depending  on  the  algorithm  or  software  in  question,  software  execution  can 
take  short  amounts  of  time  or  continue  ad  infinitum. 

Software  can  implement  any  hardware  function  but  hardware  can  only  execute  hard- 
bounded  software  functions.  For  example,  software  multiplication  such  as  squaring  could  make 
use  of  an  ‘add’  and  a  ‘count’  variable  while  hardware  could  use  a  look-up  table  to  find  the 
answer  as  shown  in  Figure  2.4.  Software  multiplication  is  more  portable  since  it  can  make  use  of 
the  same  basic  binary  logical  add  as  long  as  the  registers  that  hold  the  variables  are  large  enough 
to  prevent  overflow.  Hardware  on  the  other  hand  requires  the  input  to  be  hard  bounded  to  the 
size  of  the  input  of  the  look  up  table  and  would  produce  a  hard  bounded  output  with  twice  as 
many  bits  as  the  input.  Therefore,  software  can  operate  on  variable  inputs  and  outputs  such  as 
packets  while  hardware  implementation  would  be  impractical.  Yet,  hardware  is  capable  of  tasks 
software  can  have  difficulty  executing.  In  the  multiplication  example,  the  software  requires  a 
loop  to  repeatedly  make  use  of  an  add  function  while  the  hardware  equivalent  could  have  the 
answer  in  a  single  step.  In  systems  where  the  hardware  design  can  be  changed  to  fit  the 
application  (such  as  an  ASIC  or  FPGA),  having  a  hardware  squaring  function  compute  the 
square  could  be  advantageous  for  speed  and  power  if  the  software  often  squares  large  numbers. 

Software  Hardware 
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Figure  2.4:  Software  vs.  hardware  implementation  of  a  basic  squaring  function. 

One  could  argue  that  a  look-up  table  is  not  actually  hardware.  In  fact,  it  is  a  little  of  both 
worlds  since  the  hardware  for  a  look-up  table  already  exists  for  any  system  that  can  execute 
software.  Compilers  can  instantiate  a  look-up  table  in  memory  that  the  software  can  then 
reference  at  run-time.  However,  look-up  tables  are  extensively  used  in  hardware 
implementations  for  various  reasons  (some  even  to  act  as  a  kind  of  mini-software  for  a  hardware 
device).  In  both  hardware  and  software,  a  look-up  table  consists  of  a  decoder  that  converts  the 
“question”  to  an  address  that  points  to  the  “answer”.  In  the  previous  example  with  the  look-up 
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table,  this  is  a  very  basic  alternative  to  using  software  to  repeatedly  use  an  adder  for 
multiplicative  computations. 


2.3  Systolic  Architecture 

Before  designing  any  hardware  specific  to  a  function  already  implemented  in  software 
(the  kind  of  implementation  addressed  in  this  research),  we  must  investigate  a  concept  in 
hardware  architecture  that  is  very  useful  when  approaching  a  hardware  design  problem  known  as 
systolic  architecture.  Both  software  and  hardware  offer  many  degrees  of  freedom  in 
implementation  and  there  is  rarely  if  ever  only  one  way  to  implement  something  in  either  case. 
However,  since  an  implementation  in  software  is  fairly  easy  to  change  and  an  implementation  in 
hardware  is  nearly  impossible  to  change  (by  nature  of  how  hardware  design  is  like  painting  a 
picture),  we  must  narrow  the  hardware  implementation  design  flow  to  a  specific  approach.  A  key 
difference  between  software  and  hardware  is  the  opportunity  for  hardware  to  perform  many 
operations  in  parallel. 


Figure  2.5:  Systolic  architecture  examples.  The  top  left  is  a  generic  interaction  between  a 
microprocessor  (pP)  and  a  processing  element  (PE).  The  top  right  is  a  pipeline  style  systolic 
architecture.  The  bottom  is  a  parallel  input  and  pipelined  style  systolic  architecture. 


Designing  hardware  to  execute  something  in  parallel  is  a  vague  concept.  A  formal 
approach  to  define  what  the  parallel  hardware  seeks  to  accomplish  can  be  very  helpful.  For  this, 
a  systolic  architecture  is  essentially  defined  as  an  architecture  that  accomplishes  a  task  with  the 
maximum  throughput  given  finite  optimization  constraints.12  While  this  sounds  even  more  vague 
than  parallel  hardware,  it  means  that  when  designing  parallel  hardware,  we  want  to  perform  the 
most  computations  we  can  around  the  greatest  bottleneck  in  the  system.  Optimization 
constraints  could  be  the  interconnectivity  of  elements  or  the  structure  being  pipelined  or  parallel. 
In  most  systems,  this  is  usually  input  and  output  (I/O)  between  the  microprocessor  and  memory 
or  the  microprocessor  and  its  peripherals.  The  processor  can  only  transfer  for  example,  8,  16,  32 
or  64  bits  at  a  time  depending  on  its  architecture.  That  means  that  if  the  microprocessor  were  to 
transmit  8  bits  to  a  peripheral  component,  that  component  would  want  to  perform  the  most 
computations  possible  given  that  8-bit  input.  There  are  many  different  forms  of  systolic 
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architecture  and  a  few  possible  architectures  are  shown  in  Figure  2.5.  Thus,  the  systolic 
architecture  that  this  research  will  use  is  one  that  is  designed  around  performing  maximum 
parallel  computations  at  the  input  of  the  peripheral  device. 


2.4  Embedded  System 

Building  an  iris  recognition  system  using  exclusively  hardware  would  be  exceptionally 
difficult.  Naturally,  since  general  purpose  systems  lack  the  ability  to  have  their  hardware 
reconfigured  easily,  the  hardware  designs  cannot  be  placed  in  a  general  purpose  system.  Thus, 
this  research  will  first  set  up  a  test  system  that  migrates  the  algorithm  code  from  a  general 
purpose  system  such  a  PC  running  Windows  to  a  system  where  application  specific  hardware  can 
be  implemented  along  with  the  software  of  the  iris  recognition  algorithm.  As  mentioned  before, 
the  Nios  II  microprocessor  can  be  implemented  along  with  many  peripheral  components  to  form 
an  embedded  system  on  an  FPGA.  These  peripherals  can  be  hardware  versions  of  functions  in 
the  algorithm  C  code.  This  paper  will  refer  to  C  code  as  the  entire  category  of  C  programming 
languages  with  C++  code  or  simply  C++  being  more  specific.  In  other  papers  and 
documentation,  C  code  may  refer  to  a  totally  separate  programming  language  from  C++  but  that 
is  not  the  case  here.  First,  we  look  at  the  overall  system  design  using  Altera’s  SOPC  Builder 
shown  in  Figure  2.6  which  is  a  design  suite  intended  for  putting  together  embedded  systems 
based  around  Altera’s  Nios  II  processor. 
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Figure  2.6a:  Embedded  system  concept 
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Figure  2.6b:  Embedded  system  in  the  SOPC  Builder 


This  system  is  implemented  onto  a  board  that  allows  the  FPGA  to  be  configured  with 
hardware  and  programmed  with  software.  FPGAs  rarely  come  standalone  since  their  integration 
with  other  systems  is  hardly  “plug  and  play.”  Thus,  many  FPGAs  for  research  are  purchased  as  a 
development  board.  Development  boards  allow  engineers  to  design  and  prototype  without  having 
to  put  together  a  circuit  board  (PCB)  and  all  the  power  supply  and  analog  circuits  to  support  the 
digital  system.  For  this  research,  the  TerasIC  DE3  Development  and  Education  Board  was  used 
for  board  level  testing  of  the  iris  recognition  algorithm  and  hardware  designs.  The  DE3  board  is 
shown  in  Figure  2.7. 
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Figure  2.7:  TerasIC  DE3  Development  and  Education  Board 


The  FPGA  is  configured  with  hardware  and  memory  loaded  with  software  through  the 
dedicated  JTAG  connection  which  goes  directly  into  the  FPGA.  This  connection  also  serves  as  a 
debugger  and  allows  the  user  to  read  console  output  from  programs  running  on  the  Nios  through 
a  terminal  on  the  JTAG  host  machine.  Since  the  algorithm’s  C  code  is  approximately  680kB,  the 
program  is  much  too  big  for  on-chip  memory  (limited  at  approximately  260kB)  and  must  be 
loaded  into  the  only  other  memory  on  the  DE3  board,  the  DDR2  SDRAM  SO-DIMM  (the  green 
module  at  the  bottom  left  of  Figure  2.7).  Lastly,  the  JTAG  connection  is  insufficient  for  loading 
images  into  the  DE3  board  so  a  separate  USB  interface  was  developed  and  driver  software  for 
Windows  written  to  parse  images  from  a  host  machine  and  download  them  into  the  DE3  board 
for  processing. 

Having  set  up  the  environment  for  which  a  hardware  design  can  be  implemented  and 
tested,  we  are  now  ready  to  proceed  with  developing  hardware  alternative  functions  to  the 
algorithm’s  software  functions.  Testing  is  done  by  evaluating  the  software  timing  by  placing  start 
and  end  timers  before  and  after  the  software  function  call  respectively  and  evaluating  over 
thousands  of  iterations.  The  software  function  call  is  then  replaced  with  a  hardware  “function 
call”  that  interfaces  the  algorithm  process  with  a  hardware  alternative  instead.  The  timing  for  this 
function  is  evaluated  in  the  same  way  the  software  is  evaluated. 

Establishing  an  embedded  system  for  testing  the  algorithm  can  be  a  difficult  feat  in  itself. 
While  the  Quartus  II  software  significantly  reduces  the  design  process,  understanding  the  dual 
data  rate  (DDR)  interface  and  the  USB  protocol  are  critical  to  successful  implementation. 
Additionally,  the  Nios  II  IDE  compiles  an  earlier  version  of  C++  than  what  the  existing 
Windows-based  console  application  C  code  is  written  and  thus  extensive  knowledge  of  C++  is 
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necessary  to  convert  the  C  code  into  code  that  can  be  compiled  for  the  Nios  II  processor.  This 
included  converting  the  Windows  C++  code  from  using  templates  to  macro  classes  for  the  Nios 
II  IDE.  These  three  challenges,  C++  conversion  of  the  original  source  code,  DDR2  interfacing 
and  USB  interfacing  are  the  core  of  this  project,  proving  the  feasibility  of  a  chip-based  iris 
recognition  system. 


2.4.1  Programming  the  RED  Algorithm  into  the  FPGA  with  C++ 

The  original  algorithm  code  was  written  for  MATLAB  and  later  converted  to  C++  for 
higher  throughput  research.  The  C++  code  uses  templates  in  the  class  declarations  which  tend  to 
be  a  more  advanced  coding  practice  only  supported  by  leading  edge  compilers  such  as  Visual 
Studio  by  Microsoft.  Classes  are  a  container  object  in  C++  for  building  more  abstract  data 
structures  such  as  a  queue,  matrix  or  image  within  memory.  Templates  are  a  convenient  way  to 
implement  a  class  by  declaring  the  datatypes  at  the  class  instantiation.  Thus,  templates  eliminate 
the  need  for  writing  a  separate  class  for  every  datatype  used.  Since  the  Nios  Integrated 
Development  Environment  (IDE)  does  not  support  templates  in  classes,  the  C++  templates 
needed  to  be  converted  to  classes  with  macros  and  copied  for  every  possible  instantiation  of  the 
class.  A  macro  is  similar  to  a  template  in  that  when  the  code  is  compiled,  the  compiler  replaces 
all  instances  of  a  certain  keyword  with  the  definition  of  that  keyword  such  as  an  integer  or  float 
datatype.  This  requires  significantly  larger  program  footprint  by  more  than  quadrupling  the 
amount  of  code.  Yet,  once  the  software  is  loaded  into  the  Nios  II  embedded  system,  the 
algorithm  sequential  processes  are  now  independent  of  a  host  machine  running  an  operating 
system  environment  such  as  Microsoft  Windows  or  Linux  which  can  be  a  significant  advantage 
for  portability  and  speed.  Because  some  changes  had  to  be  made  to  the  algorithm  code  during  the 
conversion  for  the  Nios  IDE  compiler,  the  RED  algorithm  software  is  actually  a  close  match  to 
the  original  software  and  produces  approximately  the  same  results  with  less  precision  in  some 
parts  of  the  algorithm  since  the  macro  based  classes  sometimes  used  less  precise  datatypes  in 
some  functions.  Some  datatypes  had  to  be  restricted  for  simplicity  and  reduced  from  long  to 
short  or  from  double  to  float.  These  changes  do  not  corrupt  the  algorithm  since  the  less  precise 
datatypes  do  not  alter  the  distributions  of  various  calculations  such  as  the  mean  or  standard 
deviation  used  throughout  the  software. 


2.4.2  Integrating  Advanced  Memory 

The  DDR2  interface  is  an  advanced  memory  interface  used  in  commercial  PCs  available 
everywhere.  Both  the  algorithm  code  and  the  space  needed  to  store  multiple  working  copies  of 
images  required  far  more  space  than  what  was  available  on  the  Stratix  III  FPGA.  This  required 
the  use  of  the  DE3  board’s  only  other  memory  option  -  the  DDR2  discrete  memory  module  on 
the  side  of  the  board.  Communicating  with  DDR2  requires  an  interface  because  the  Nios  II 
addressing  and  data  communication  is  much  different  that  the  addressing  and  data  of  the  DDR2. 
This  interface  requires  a  controller  to  convert  the  language  of  the  Nios  II  processor  into 
something  understood  by  the  DDR2  module.  An  in-depth  explanation  of  these  interfaces  is 
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unnecessary  and  beyond  the  scope  of  this  paper.  However,  interfacing  the  DDR2  proved 
exceptionally  challenging  and  created  many  problems  for  this  project. 

DDR2  requires  extremely  complex  and  precise  timing  that  caused  the  embedded  system 
to  be  meta-stable  and  only  meet  timing  within  tens  of  picoseconds.  Data  transfer  from  register  to 
register  is  stable  when  the  register  is  not  clocked  at  the  same  time  data  is  switching.  Narrow 
timing  margin  means  the  register  is  clocked  very  close  to  the  switching  of  the  data.  This  meta¬ 
stability  is  the  result  of  a  design  flaw  in  the  DE3  board  due  to  improper  routing  of  the  DDR2 
traces  and  physical  connections  with  the  Stratix  III,  forcing  the  internal  FPGA  system  to  cope 
with  poor  pin  locations  chosen  when  the  board  was  printed.  Timing  stability  in  the  picoseconds 
provides  only  extremely  narrow  margins  for  successful  operation  even  for  true  ASICs  and 
microprocessors  running  in  excess  of  gigahertz.  For  an  FPGA  which  has  lower  performance 
logic  than  these  other  devices  and  clock  frequencies  in  the  10’s  and  100’s  of  megahertz,  working 
with  stability  in  the  picoseconds  requires  other  parameters  of  operation  such  as  temperature  to 
possibly  be  as  narrow  as  tenths  of  degrees.  Temperature  directly  influences  the  silicon  die  of  the 
FPGA  and  higher  temperatures  create  more  unpredictable  timing  of  a  signal  through  longer 
length  routing.  Since  the  system  has  functioned  for  brief  periods  of  time,  temperature  seems  to 
be  a  possible  factor  contributing  to  the  successful  operation  of  the  algorithm  but  cannot  be 
controlled  on  the  DE3  board  at  the  necessary  precision.  Yet,  many  things  could  cause  timing 
instability  and  since  it  is  impossible  to  pear  inside  the  chip  the  actual  cause  of  the  system 
instability  is  unknown.  Since  the  software  is  being  run  from  the  DDR2  memory,  instability  either 
makes  the  system  very  difficult  to  program  or  the  program  to  halt  halfway  through  due  to  bad  or 
illegal  instructions  (software).  The  data  memory  is  less  susceptible  to  memory  instability  but 
errors  in  data  can  certainly  cause  issues  in  the  processing  of  images.  However,  this  was  never 
investigated  since  problems  with  memory  caused  failure  for  both  instructions  (software)  and 
data.  If  the  Nios  was  able  to  successfully  execute  the  entire  algorithm,  it  can  be  assumed  that  the 
data  was  stable  and  reliable  throughout  the  entire  execution. 


2.4.3  Building  Interfaces  with  the  DE3  Board 

Lastly,  the  other  challenging  development  for  the  embedded  system  was  the  USB 
interface.  The  embedded  system  does  not  have  a  camera  dedicated  to  capturing  images  and 
placing  them  into  memory  for  processing.  Thus,  some  means  of  getting  images  into  the  DE3 
board  was  necessary.  The  DE3  board  has  limited  interfacing,  with  only  USB  host  and  device 
ports  readily  available.  Again,  the  USB  protocol  is  exceptionally  complicated  and  beyond  the 
scope  of  this  paper.  To  bring  images  into  the  embedded  system,  driver  software  had  to  be  written 
for  the  Nios  II  microprocessor  to  talk  to  the  USB  port  as  well  as  driver  software  for  a  host  system 
(Microsoft  Windows  Vista  based)  to  detect  the  DE3  board  as  a  slave  device  and  transmit  images. 
Using  the  Microsoft  Developer  Network’s  WinDDK,  driver  software  was  written  to  parse  a 
bitmap  image  of  any  format  into  a  matrix  of  8-bit  values,  open  a  connection  with  the  DE3  board 
and  transmit  the  image  in  a  series  of  32  byte  packets.  Transmission  of  a  typical  640x480  VGA 
image  required  a  little  over  9600  packets  (1  byte  in  each  packet  acts  as  a  header)  and  took  only 
about  1  second.  Due  to  the  simplicity  of  the  driver  on  the  Nios  II  side,  another  image  cannot  be 
pushed  to  the  DE3  board  until  the  algorithm  is  fully  executed. 
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For  demonstration  purposes,  a  human  interface  was  developed  so  the  user  can  get 
feedback  of  the  system  operation.  The  DE3  board  is  equipped  with  LEDs  that  can  be  red,  green 
or  blue  or  any  combination  of  those  colors  for  each.  These  LEDs  are  initially  all  in  the  unlit  state. 
When  a  new  unrecognized  eye  is  loaded  onto  the  DE3  board,  the  system  sets  one  of  the  LEDs 
red  (up  to  8  possible  unique  eyes).  When  an  eye  that  is  recognized  as  a  previously  loaded  eye,  it 
will  light  the  corresponding  LED  to  that  eye  green  momentarily  and  back  to  red.  Images  of 
unrecognized  eyes  will  instead  light  a  new  LED  red. 


2.5  Bringing  Everything  Together 

The  fully  implemented  software  and  interfaces  now  allows  for  the  testing  of  hardware 
devices  integrated  into  the  embedded  system.  These  hardware  devices  can  be  parallel 
implementations  and  are  meant  to  ease  the  processing  burden  of  the  Nios  II  processor  by  pushing 
the  data  to  a  hardware  based  function  rather  than  using  the  original  software  functions.  The 
embedded  system  allows  for  testing  and  evaluation  of  the  hardware  based  functions  and 
comparisons  to  be  drawn  because  both  implementations  can  be  compared  on  the  same  system. 
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Chapter  3 


Binary  Morphology 

Computer  systems  which  this  paper  will  refer  to  as  general  purpose  systems  are 
electronic  machines  that  execute  instructions  known  as  software.  Many  computer  systems  have 
operating  systems  that  run  in  the  background  along  with  many  other  concurrent  processes  or 
programs  that  can  slow  the  general  purpose  system  down.  Iris  recognition  algorithms  are  usually 
written  in  C  code  and  compiled  into  an  application  that  the  operating  system  runs  among  other 
programs.  For  the  purpose  of  optimization,  the  iris  recognition  software  would  ideally  be  isolated 
from  the  parent  system  and  concurrent  processes.  This  could  allow  a  processor  to  focus  its  full 
throughput  on  the  execution  of  the  algorithm  software.  Furthermore,  general  purpose  systems  are 
not  especially  mobile  (with  the  exception  of  very  lightweight  laptop  computers)  and  thus  may 
not  serve  every  venue  where  iris  recognition  systems  could  be  deployed.  Thus,  a  hardware 
implementation  of  an  iris  recognition  system  is  especially  interesting  as  it  could  be  exceptionally 
faster  than  its  general  purpose  counterpart  while  also  being  small  enough  to  be  part  of  a  digital 
camera  or  camera  phone  with  sufficient  resolution  to  detect  fine  iris  features. 


3.1  Iris  Recognition  Segmentation 

Segmentation  searches  for  the  iris  within  an  image  of  an  eye.  For  template  generation  to 
be  successful,  the  segmentation  process  must  properly  extract  the  iris  from  the  image.  As 
humans,  we  have  an  exceptionally  powerful  image  processor  working  for  us  as  a  result  of 
millions  of  years  of  development  that  allows  our  brains  to  effortlessly  segment  out  objects  in  our 
field  of  view.  For  a  computer,  segmenting  objects  from  an  image  is  similar  to  trying  to  inspect  a 
photograph  by  looking  through  a  coffee  straw.  Computers  lack  the  ability  to  see  the  “big  picture” 
and  must  use  lengthy  and  complex  image  processing  functions  to  break  down  the  image. 

The  iris  is  bounded  by  the  pupil  at  the  inner  radius  and  the  limbic  boundary  at  the  outer 
radius.  Finding  these  boundaries  begins  with  finding  the  pupil  since  it  is  assumed  to  be  the 
easiest  part  of  the  image  to  segment  with  the  assumption  that  it  is  simply  a  large  black  circle 
somewhere  in  the  image.  To  identify  the  pupil,  the  original  image  undergoes  contrast  limited 
histogram  equalization  (CLAHE)  to  adjust  contrast  and  better  differentiate  the  difference 
between  minimum  and  maximum  values  across  the  image  before  thresholding  all  the  values.  All 
values  above  the  threshold  (typically  the  mean  plus  one  standard  deviation  of  all  pixels  in  the 
image)  are  assigned  a  value  of  ‘0’  and  all  values  below  are  assigned  a  ‘1.’  This  leaves  only  the 
pupil  and  other  very  dark  objects  such  as  some  eyelashes  and  perhaps  some  of  the  eyebrow. 
However,  due  to  the  nature  of  the  cornea  being  at  its  apex  right  above  the  pupil,  there  are  often 
glare  artifacts  directly  over  the  pupil  and  cause  the  pupil  to  be  obfuscated  and  irregular.  To 
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correct  for  artifacts  and  find  the  existence  of  objects  in  the  image,  the  segmentation  process  uses 
a  series  of  binary  morphology  functions  to  dilate,  erode  and  assign  values  to  the  objects  based  on 
area  and  perimeter. 

Once  the  pupil  is  detected  by  comparing  the  different  objects  in  the  image,  the  pupil’s 
center  is  used  as  a  starting  point  back  in  the  original  image  for  finding  the  limbic  boundary. 
Local  statistics  compute  the  kurtosis  of  3  by  3  windows  all  around  the  image.  The  algorithm  then 
picks  values  radially  from  around  the  area  of  the  limbic  boundary  and  accumulates  the  values. 
Segmentation  then  steps  left  and  right  of  the  pupil  center  recalculating  the  sum  of  the  local 
kurtosis.  The  smallest  value  is  most  likely  the  true  location  of  the  limbic  boundary  and  the  iris 
can  be  located  and  unwrapped  for  template  generation. 


3.2  Binary  Morphology 

Binary  morphology  functions  operate  on  the  image  after  it  has  been  thresholded  in  order 
to  find  the  existence  of  relevant  objects  such  as  the  pupil.  When  an  image  is  thresholded,  the 
pixel  values  in  the  image  are  reduced  to  a  single  bit  for  logical  operation.  Approximately  half  of 
the  algorithm’s  segmentation  execution  time  is  consumed  by  binary  morphology  functions.  The 
most  difficult  of  these  functions  are  the  dilation  and  erosion  functions  which  are  slow  even  when 
pre-processing  the  image  to  find  objects  for  dilation  and  erosion  to  operate  on.  Figure  3.1 
illustrates  a  single  step  dilation  and  erosion  and  how  the  resulting  object  is  roughly  the  same 
shape  but  without  the  artifact  in  the  center. 


Original  Dilation  Erosion 


Figure  3.1:  Binary  morphology  demonstration. 

Left:  image  before  dilation  and  erosion.  Middle:  one  step  of  dilation.  Right:  one  step  of 
erosion.  Note  that  the  artifact  (hole)  was  filled  in  the  final  image  but  the  general  shape  is 

retained. 
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Logically,  dilation  and  erosion  are  the  same  function.  Dilation  looks  at  all  the  ‘  1  ’  values 
in  the  image  and  places  a  ‘1’  in  every  immediately  adjacent  pixel  above,  below,  left  and  right. 
Similarly,  erosion  places  a  ‘0’  value  pixel  adjacent  to  every  ‘O’  pixel  in  the  image.  These 
functions  must  scan  all  the  relevant  pixels  and  make  non-linear  adjustments  around  the  image. 
Therefore,  binary  morphology  even  when  ideally  optimized  is  a  growth  function.  However,  with 
many  objects  and  many  pixels  in  the  image,  binary  morphology  can  become  increasingly  messy 
and  complex,  consuming  much  of  the  general  purpose  processor’s  attention. 


3.3  Methodology 

This  paper  will  take  the  binary  morphology  part  of  the  RED  algorithm  and  compare  a 
software  and  hardware  implementation  of  the  dilation  and  erosion  functions  in  order  to 
demonstrate  the  effectiveness  of  a  hardware  equivalent  function.  Hardware  based  segmentation 
could  be  faster,  slower  or  more  difficult  depending  on  the  function  being  implemented.  These 
functions  are  directly  comparable  to  other  standard  software  based  image  processing  functions. 
Also,  there  are  many  opportunities  for  alternate  implementation  in  the  algorithm:  implementation 
of  the  binary  morphology  dilation  and  erosion  functions  are  just  a  couple  of  many  possible 
functions  that  can  be  implemented.  In  order  to  properly  implement  the  hardware  equivalent,  we 
investigate  the  software  implementation  of  dilation  and  erosion  in  order  to  understand  how  the 
logical  function  should  behave  in  hardware. 

To  properly  evaluate  the  conversion  from  software  to  hardware,  a  test  system  needs  to  be 
developed  to  execute  the  entire  algorithm  in  software  on  the  target  device.  The  FPGA  is 
programmed  with  a  test  system  that  can  execute  C  code  and  the  details  of  this  implementation 
are  beyond  the  scope  of  this  paper.  The  FPGA  system  is  different  than  a  general  purpose  system 
because  the  processor  on  the  FPGA  runs  only  the  algorithm  code  and  no  other  processes  but  at  a 
slower  clock.  Additionally,  the  speed  of  the  FPGA  system  is  generally  slower  than  a  general 
purpose  system.  This  FPGA  system  is  fed  an  image  of  an  eye  where  the  segmentation,  template 
generation  and  template  matching  processes  begin  just  as  they  would  on  a  general  purpose 
system.  Note  that  the  programming  of  an  FPGA  is  different  from  the  programming  of  software 
into  a  computer.  This  programming  defines  the  interconnection  of  basic  logic  elements  inside  the 
FPGA  so  that  the  desired  logic  (processor,  memory  controller  and  test  hardware  device)  is 
realized.  However,  unlike  the  general  purpose  system,  the  FPGA  system  has  the  ability  to  call  on 
a  specially  designed  peripheral  hardware  component  inside  the  FPGA  to  process  data  and  can 
read  the  result  back  almost  seamlessly. 

Upon  converting  the  function  from  software  to  hardware,  the  software  based  function  call 
is  followed  by  a  new  function  that  calls  on  the  hardware  device  for  data  processing.  The 
functions  under  test  are  isolated  in  the  algorithm  by  placing  reference  points  before  and  after  the 
function  call  that  records  start  and  end  times.  The  output  of  the  software  function  can  be  used  to 
assert  the  valid  output  of  the  hardware  based  function  and  the  execution  times  can  be  compared. 
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3.4  Software  implementation 

The  software  version  of  the  algorithm  preprocesses  the  image  by  scanning  the  image  for 
objects  that  have  an  area  that  is  at  least  the  minimum  area  for  the  pupil.  These  objects  are 
individually  passed  into  the  dilate  and  erosion  functions  for  processing  in  an  effort  to  avoid 
processing  the  entire  image  in  a  brute  force  function.  The  software  function  calls  appear  in  the 
algorithm  as  shown  in  Tables  3.1  and  3.2. 


Table  3.1:  Binary  morphology  function  calls 
from  FindPupil  function. 


1 abel =Th  reshol dUlNTl6lmage (i m , 
(UlNTl6)threshval ) ; 

areas=l abet ->BWLabel (MIN_PUPIL_AREA) ; 

bwobjs=ConvertLabel lmageToObjects(l  abel , 
areas)  ; 

for  (i=0;i<bwobjs->Nobj  ;i++)  { 

BWDil atelnPl ace(bwobjs->obj [1 ] -> 
data,sel);  //  Dilate 
BWErodelnPl acefbwobj s->obj [i  ]  -> 
data,se2);  //  Erode 


Table  3.2:  BWDilatelnPlace  function 
definition.  Note  that  the  return  of  the  image 
is  inverted 


void  BWDilatelnPlace(lmage  *in,  Image  *se) 


for  (i =1 ;  i <in->Actual Rows-1;  i++) 
for  C j  =1 ; j <i n->ActualCol s-1; j++) 
if  (in->matrix[i]  [j]) 

i  nl-> 

matrix[i+offset]  [j+offset]=l; 
i f  ((i n->matrix [i -1]  [j]==0)  |  | 
(in->matrix[i+l]  [j]==0)  |  | 

(in->matrix[i]  [i-l]==0)  |  | 

(in->matrix[i]  [j+l]==0)) 

i of f set=i +of f set ; 
joffset=j+offset ; 
for (i i =0 ; i i <numones ;  i  i  ++) 

{ 

i  nl->matrix[ioffset+ 
rowi  ndex[i  i  ]  ] 

[joffset+col  i  ndex[i  i  ]  ]  =1 ; 

}; 

}; 


The  software  implementation  uses  three  FOR  loops  to  step  through  the  objects  in  the 
image  and  perform  dilate/erode  processing.  Each  function  call  steps  through  all  relevant  pixels 
and  assigns  the  appropriate  bit  based  on  adjacent  black  and  white  values.  Additionally,  each 
function  must  be  called  15  times  for  a  total  of  15  dilates  and  15  erodes  to  fully  process  each 
object  in  the  image.  With  a  large  image  that  has  many  objects  covering  enough  area  to  possibly 
be  the  pupil,  the  functions  can  take  a  lot  of  time  to  process.  Next,  we  will  investigate  a  hardware 
alternative  to  these  functions. 

The  business  portion  of  the  software  implementation  is  the  looping  within  each  binary 
morphology  functions.  These  loops  use  basic  logical  ‘or’  of  the  neighboring  pixel  above,  below, 
left  and  right  to  determine  each  bit  assignment.  These  loops  are  also  the  most  time  consuming 


26 

since  they  are  the  highest  order  nest  in  the  function  calls.  A  hardware  implementation  could 
reduce  these  loops  to  a  growth  function. 

The  software  implementation  takes  into  account  edge  effects  by  padding  the  outside 
edges  of  the  image  with  zeros.  The  borders  of  the  image  complicate  the  dilation  and  erosion 
functions  since  pixels  on  the  far  outside  of  image  will  reference  pixels  not  defined  within  the 
rows  and  columns  of  the  image.  Instead,  these  areas  are  fdled  with  null  values  so  that  they  do  not 
contribute  to  the  dilation  or  erosion  of  objects  very  close  to  the  borders. 

Another  advantage  with  hardware  is  that  a  hardware  implementation  takes  advantage  of 
replication  to  speed  up  processing  power.  The  preprocessing  of  the  image  to  find  objects  before 
dilation  and  erosion  is  not  necessary  to  increase  speed.  Instead,  the  initial  object  detection  of  the 
algorithm  will  be  moved  after  dilating  and  eroding  all  objects  in  the  image  simultaneously. 
Theoretically,  this  reordering  of  operations  should  not  affect  the  output. 

3.5  Hardware  Implementation  (VHDL) 

Hardware  design  is  inferred  similar  to  the  way  software  is  coded  using  special 
programming  languages  such  as  Verilog  Hardware  Description  Language  (Verilog  HDL)  or 
Very  high  speed  integrated  circuit  Hardware  Description  Language  (VHDL).  Designs  are 
compiled  using  a  synthesizer  and  fitter  that  build  a  netlist  describing  how  the  fundamental  logic 
elements  within  the  FPGA  are  wired  together. 

The  primary  function  to  be  implemented  is  the  logical  ‘or’  function  that  we  identified 
from  the  software  function  loops.  These  ‘or’  functions  are  pixel  oriented  and  directly  translate  to 
an  elementary  unit  in  hardware  that  we’ll  call  a  BMElement  (BM  for  Binary  Morphology) 
which  represents  the  pixel  being  operated  on.  The  value  stored  within  this  BM  Element  is  driven 
by  similar  hardware  elements  holding  values  for  pixels  adjacent  to  the  pixel  stored  in  this 
BM  Element.  Thus,  these  BMElements  are  easiest  to  imagine  as  though  they  were  arranged  in 
the  ‘shape’  of  anmxn  image.  The  logic  for  these  BMElements  also  takes  into  account  both  the 
dilation  and  erosion  since  they  are  complimentary  binary  logic  functions. 

The  coding  procedure  for  this  hardware  is  to  instantiate  a  full  dilation  and  erosion 
hardware  entity  called  BMDilateErode  that  will  receive  the  pixel  data  from  the  processor  one 
pixel  at  a  time  (with  a  width  of  only  a  single  bit  for  ‘1’  or  ‘0’)  and  transmit  the  resulting  pixel 
data  back  to  the  processor  one  pixel  at  a  time.  BM  Dilate  Erode  will  act  as  a  three-state  state 
machine  that  receives  the  pixel  data,  processes  the  data  and  then  sends  the  data  back.  Within  the 
overarching  BM  Dilate  Erode  instantiation  are  the  individual  BM  Element  blocks  and  an 
interconnect  fabric  that  allows  for  pixel  data  to  be  loaded  into  each  BM  Element  and  for  each 
BM  Element  exchange  relevant  pixel  data  among  each  other.  Data  enters  the  BM  Dilate  Erode 
device  as  a  shift  register  as  shown  in  Figure  3.2. 


Data  Out 


Figure  3.2:  Binary  morphology  data  input  train.  The  input  shifts  the  data  into  the  elements 
starting  at  the  first  and  ending  at  the  last  BMElement  represented  by  the  labeled  “E” 
blocks.  The  processor  shifts  the  last  pixel  of  the  thresholded  image  in  first  and  when 
processing  is  complete  the  data  is  shifted  out  last  pixel  first. 


The  interconnect  fabric  within  the  BMDilateErode  machine  is  the  most  challenging 
part  of  the  device.  Unlike  the  software  implementation,  this  fabric  cannot  account  for  edge 
effects  using  the  technique  of  padding  (which  places  null  values  around  the  edges).  Instead,  it 
circularly  connects  the  top  and  bottom,  left  and  right  elements  together  so  that  these  connections 
are  not  undefined.  The  effects  of  these  circular  connections  are  negligible  since  a  segmentable 
pupil  will  be  located  somewhere  within  the  image  borders  and  ideally  not  partially  cut  off  at  the 
edge  of  the  image.  The  interconnect  fabric  is  actually  a  long  string  of  “nodes”  numbered  from 
zero  to  the  total  number  of  BMElements  within  the  device.  These  nodes  represent  the  connect 
points  between  different  BM  Elements  and  each  BM  Element  references  adjacent  elements  by 
their  numbered  connection  in  this  fabric.  This  is  illustrated  in  Figure  3.3. 


Element  Array  Local  Interconnect 


in 


E+1 


Figure  3.3:  The  binary  morphology  local  interconnect  fabric  is  a  series  of  nodes  labeled 
from  zero  to  the  total  number  of  BM  Elements  in  the  BM  Dilate  Erode  device.  Each 
BM  Element  inputs  the  adjacent  pixel  values  held  by  neighboring  BM  Elements  by 
referencing  neighboring  nodes  in  the  fabric  and  outputs  the  pixel  value  held  within  the 
BM  Element  to  its  associated  interconnect  fabric  node. 
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Each  BMElement  computes  the  dilation  and  erosion  using  only  a  few  lines  of  code  as 
shown  in  Table  3.3  and  produces  the  logic  blocks  seen  in  Figure  3.4.  Like  a  function  call  in 
software,  the  BM  Element  encapsulates  recurring  code  into  a  single  logic  block  with  finite 
inputs  and  outputs.  These  inputs  reference  the  neighboring  four  nodes  and  necessary  control 
signals.  It  outputs  its  value  back  into  the  local  interconnect  fabric  node  allotted  for  that 
BM  Element.  This  is  accomplished  using  the  code  shown  in  Table  3.4.  The  hardware  operation 
is  demonstrated  in  Figure  3.5. 


Table  3.3:  BM  Element  functional  logic 


process (aload,  adatain,  elk) 
begi  n 

if  (aload  =  '1')  then 

adataout  <=  pixel; 
pixel_step  <=  adatain; 

el  se 

if  (ri  si  ng_edge(cl k))  then 

i f (cl k_enabl e  =  '1')  then 

if (operation  =  ’O')  then  —  Dilate 

pixel_step  <=topmid  or  midleft  or  midright  or 
bottommi d ; 

el  se 

pixel_step  <=  not(topmid  or  midleft  or  midright  or 
bottommi d) ; 

end  if; 

process(cl  k) 
begi  n 

i f (ri si ng_edge(cl k))  then 

if (operation  =  '0')  then 
output  <=  pixel; 

el  se 

output  <=  not(pixel); 

end  if; 

end  if; 
end  process; 
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Figure  3.4:  Internal  logic  of  one  BMElement  block  inferred  from  the  code  seen  in  Table 
3.3.  The  device  takes  in  4  inputs  from  the  local  interconnect  and  ‘or’s  the  logic  together. 
This  value  is  then  inverted  depending  on  the  operation  (dilate  or  erode)  and  latched  at  the 
pixel  step  flip-flop.  There  is  also  control  logic  to  shift  the  value  in  this  BM  Element  on  to 
the  next  BM  Element  and  input  the  value  from  the  previous  BM  Element  in  the  chain  as 

demonstrated  in  Figure  3.2. 


Table  3.4:  BM  Element  instantiation  and  interconnect  logic 


signal  1 ocal_i nterconnect  :  std_logic_vector(ROWS*COLS  downto  0) ; 


BM_rows:  for  j  in  0  to  ROWS  -  1  generate 
begi  n 

BM_cols:  for  i  in  0  to  COLS  -  1  generate 
begi  n 

Ul:  BM_Element 
port  map( 

elk  =>  avs_sl_clk, 
clk_enable  =>  clk_e, 
operation  =>  operation, 
aload  =>  shift_load, 
adatain  =>  local_data(j*COLS  +  i), 
adataout  =>  local_data(j*COLS  +  i  +1), 

topmid  =>  local_interconnect(((j  +  ROWS  -  1)  mod  ROWS)*COLS  +  ((i  + 
cols)  mod  cols)  ) , 

midleft  =>  local_interconnect((j*COLS)  +  ((i  +  (COLS-l))  mod  COLS)), 
midright  =>  local_interconnect((j*COLS)  +  ((i  +  (COLS+l))  mod  COLS)), 
bottommid  =>  local_interconnect(((j  +  ROWS  +  1)  mod  ROWS)*COLS  +  ( (i 
+  cols)  mod  cols)  ) , 

output  =>  local_interconnect(  j*COLS  +  i) 

)  l 

end  generate; 


end  generate; 
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Figure  3.5:  Dilation  and  erosion  in  hardware.  For  dilation,  the  hardware  performs  logic 
‘OR’  on  all  pixel’s  neighboring  pixels  in  parallel  and  if  any  of  them  are  black  the  particular 
pixel  in  question  becomes  black.  For  erosion,  the  inputs  and  outputs  are  inverted  and  thus 
performs  the  same  logical  ‘OR’  but  on  white  pixels. 


Once  the  entire  design  is  put  together,  it  must  be  incorporated  into  the  overall  FPGA 
system  so  that  the  performance  of  the  hardware  design  can  be  evaluated  relative  to  the  original 
software  design.  In  this  system,  the  design  is  included  as  a  peripheral  component  to  the 
microprocessor  in  the  system  and  the  compiler  provides  the  microprocessor  with  a  pointer  to  the 
input  of  the  BM  Dilate  Erode  device.  The  algorithm  software  code  is  rewritten  to  send  the 
entire  image  for  dilation  and  erosion  and  the  result  is  read  back  for  object  detection.  This 
peripheral  totally  eliminates  the  dilation  and  erosion  binary  morphology  functions  from  the 
algorithm  in  exchange  for  simple  copy  functions  that  move  data  from  DDR2  SDRAM  into  the 
peripheral  device  and  back  out. 
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Chapter  4 
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Figure  4.1:  Capturing  the  image  of  an  eye  and  storing  the  result  as  pixel  data 

Iris  recognition  biometrics  measure  the  human  eye  as  though  it  were  a  digital  signal 
conveying  a  message  of  identity.  The  signal  is  transmitted  from  the  iris,  the  eye’s  colored  region 
around  the  black  pupil,  visually  via  infrared  radiation  and  captured  using  a  camera,  storing  the 
signal  as  digital  ones  and  zeroes  (picture  elements  or  pixel  data)  as  seen  in  Figure  4.1.  Since 
pixel  data  is  nearly  meaningless  to  a  computer,  digital  signal  processing  must  be  performed  to 
extract  meaningful  information  from  ones  and  zeroes.  Biometric  algorithms  depend  on  digital 
filtering,  a  kind  of  digital  signal  processing,  to  transform  simple  pixel  data  into  meaningful  bits 
used  to  match  a  person’s  eye  with  a  stored  digital  counterpart.  In  current  recognition  systems,  a 
general  purpose  processor  follows  instructions,  also  known  as  software,  to  perform  the  digital 
filtering  a  single  step  at  a  time.  At  high  clock  frequencies,  software  driven  filtering  is  not  as 
much  a  troubling  issue;  however,  when  migrating  this  or  any  digital  filtering  to  hardware  such  as 
a  field  programmable  gate  array  (FPGA),  slower  clock  frequencies  manifest  a  challenge  that 
otherwise  may  not  be  a  problem.  FPGAs  can  mitigate  having  a  slow  clock  by  maximizing 
efficiency  in  highly  application  specific  hardware  design.  For  an  iris  recognition  system, 
specially  tailored  parallel  digital  filters  can  help. 

4. 1  Filtering  in  Hardware 

Parallel  filtering  can  take  on  many  different  forms  and  an  engineer  may  choose  the  form 
of  parallelization  that  best  fits  the  needs  of  the  application.  In  any  parallel  digital  operation, 
memory  is  a  key  factor  in  design  since  it  is  often  either  a  bottleneck  or  an  FPGA  device  resource 
hog.  Fleavy  parallelization  in  hardware  will  invariably  require  more  memory  to  support 
simultaneous  reading  and  writing.  Iris  recognition  algorithms  typically  use  nine  by  nine  (9x9) 
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filters  which  can  rapidly  consume  an  FPGA’s  available  RAM  resources  needed  to  keep  track  of 
many  parallel  intermediate  steps.  Therefore,  the  form  of  parallelization  used  in  this  design  seeks 
to  maximize  the  amount  of  parallel  computations  while  keeping  the  memory  needed  for 
intermediate  computations  optimized  and  at  an  absolute  minimum. 


Input  Signal 
9(x) 


Periodic  (Circular)  Input  Signal 

9n(x) 


Filter 

h(x) 


Area  of  intersection  increases 
Convolution  as  signal  moves  through  filter 


f(x)  =  gn(x)  *  h(x) 


Figure  4.2:  Circular  convolution  in  one  dimension 

Before  filtering,  the  inner  and  outer  boundaries  of  the  iris  were  identified.  The  Ridge 
Energy  Direction  (RED)  Algorithm  divides  the  iris  into  at  least  16  concentric  circles  and  128 
radial  lines.  The  iris  is  then  “unwrapped”  by  picking  off  values  from  the  original  image  at  each 
intersection  of  radial  line  and  concentric  circle  and  recording  them  value  in  a  2-D  matrix  of 
exclusively  iris  pixels.  This  process  is  likened  to  converting  the  iris  from  polar  coordinates  to 
Cartesian  coordinates. 

The  RED  Algorithm  uses  finite  impulse  response  (FIR)  filtering  to  evaluate  the 
information  content  of  an  iris.  FIR  filtering  is  the  type  of  filtering  done  by  a  digital  system  in  the 
time  domain  within  a  finite  set  of  coefficients.  The  algorithm  uses  two  9  by  9  (9x9)  filters  to 
accentuate  the  vertical  and  horizontal  features  separately.  A  9x9  filter  will  be  able  to  emphasize 
the  existence  of  ridges  as  wide  as  the  filter  dimension.  This  assumes  that  most  ridges  are  features 
small  enough  to  be  detected  by  the  9x9  filter.  Features  that  are  too  large  will  essentially  be 
ignored  by  the  filters  and  may  result  in  a  mask  bit  modulating  out  that  part  of  the  template  if 
neither  filter  can  determine  vertical  or  horizontal  ridges.  Furthermore,  these  filters  are  unaffected 
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by  the  original  resolution  of  the  image  since  the  density  of  pixels  (number  of  pixels  per  feature) 
are  determined  directly  by  the  choice  number  of  concentric  annuli  and  radial  lines.  Higher  or 
lower  resolution  segmented  irises  will  always  result  in  the  same  density  of  pixels  per  feature 
since  the  number  of  points  chosen  to  transform  from  polar  coordinates  in  the  original  image  to 
Cartesian  is  constant.  The  output  of  each  filter  is  compared  and  for  each  pixel,  a  ‘  1  ’  is  assigned 
for  strong  vertical  content  or  a  ‘0’  for  strong  horizontal  content.  These  bits  are  concatenated  to 
typically  form  a  2048  bit  vector  unique  to  the  “iris  signal”  that  conveyed  the  identifiable 
information. 

The  RED  iris  recognition  algorithm  uses  a  straightforward  circular  convolution  digital 
filter  similar  to  Figure  4.2  to  generate  the  matchable  iris  template,  a  set  of  bits  that  meaningfully 
represents  a  person’s  iris.  The  energy  data  passed  from  the  iris  segmentation  process,  the  process 
that  seeks  and  extracts  the  iris  from  an  image  of  a  person’s  eye,  is  organized  into  a  rectangular 
array  that  is  made  periodic  in  both  the  vertical  and  horizontal  dimensions  to  account  for  edge 
effects.  The  filter  passes  over  this  periodic  array  taking  in  8 1  values  at  a  time  and  computing  the 
result  for  the  filter,  storing  the  result  in  a  memory  location  that  corresponds  to  the  centroid  of  the 
filter.  The  digital  filter  continues  to  take  in  values  in  this  manner,  stepping  right  column  by 
column  and  down  row  by  row  until  the  filtering  is  complete  as  shown  in  Figure  4.3.  Finally,  the 
template  is  generated  by  comparing  the  results  of  two  different  digital  filters  (horizontal  and 
vertical)  and  writing  a  single  bit  that  represents  the  filter  with  the  highest  output  at  the  equivalent 
location. 
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Figure  4.3:  9x9  digital  filter  computing  the  circular  convolution  of  the  top  left  portion  of 
hypothetical  energy  data.  In  this  instance,  each  coefficient  of  the  filter  is  multiplied  by  the 
corresponding  energy  data  within  the  scope  of  the  filter  (filter  kernel)  where  some  of  the 
data  is  repeated  from  the  opposing  sides.  These  filter  coefficient  and  energy  data  products 
make  up  a  partial  result,  the  sums  of  which  generate  a  local  result  corresponding  to  the 

centroid  of  the  filter. 


4. 2  Problem  A  ddressed 
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1.  Software  Based  Sequential 


Figure  4.4:  Illustration  of  different  possible  digital  filtering  methods. 

There  are  three  outstanding  methods  to  compute  a  digital  filter  as  shown  in  Figure  4.4. 
The  first  method  uses  a  general  purpose  processor  to  compute  the  digital  filter  via  software, 
executing  a  single  instruction  at  a  time.  This  approach  is  acceptable  if  the  digital  filter  is  small; 
however,  as  the  digital  filter  dimension  increases,  the  computational  load  rises  exponentially. 
The  second  approach  is  to  develop  hardware  specifically  designed  to  compute  the  digital  filter 
but  by  first  loading  all  the  data  into  an  array  of  computing  elements.  By  loading  the  data  first 
(buffering  the  data  into  the  hardware),  this  approach  is  essentially  identical  to  the  sequential 
general  purpose  processor  method  due  to  overhead.  Only  partial  convolutions  can  be  computed 
until  the  entire  kernel  is  filled.  The  third  and  final  approach  is  to  redefine  the  convolution  as  a 
parallel  process,  allowing  the  hardware  to  take  in  a  single  byte  of  data  and  compute  the 
maximum  influence  that  data  would  have  on  the  final  results.  The  method  is  a  truly  parallel 
process  and  minimizes  the  need  for  overhead  and  buffering  that  other  sequential  processes  use. 
Throughout  this  paper,  the  parallel  method  of  approach  will  be  investigated  and  compared  to 
sequential  methods. 

Parallelizing  the  RED  algorithm  filtering  can  be  done  simply  by  doing  the  obvious  - 
calculating  the  result  of  both  filters  at  the  same  time  and  computing  the  single  bit  representation 
in  the  template  simultaneously.  Unfortunately,  while  this  straightforward  approach  appears  to  cut 
out  a  large  amount  of  intermediate  memory  it  in  fact  suffers  from  a  very  serious  memory 
bottleneck.  The  energy  data  to  compute  a  single  convolution  result  must  first  be  fully  loaded  into 
a  hardware  representation  of  each  filter.  Assuming  the  filters  could  compute  the  final  template 
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bit  in  a  single  clock  cycle,  each  filter  must  read  81  different  8-bit  values  of  energy  data  before 
any  fdtering  can  be  done.  Therefore,  a  straightforward  approach  to  convolution  based  digital 
filtering  in  hardware  will  suffer  due  to  the  overhead  associated  with  waiting  for  data  to  be 
localized  near  the  computing  elements.  In  this  case,  since  the  computing  elements  are  idle  until 
the  81  data  values  are  loaded,  the  process  is  fundamentally  sequential  much  like  the  general 
purpose  processor.  Consequently,  a  straightforward  convolution  in  hardware  is  inherently  not 
parallel  with  a  single  bank  of  memory  holding  all  the  energy  data. 
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Figure  4.5:  Possible  considerations  for  sequential  hardware  method 


To  parallelize  the  digital  filter  convolution,  the  memory  itself  must  be  parallelized.  A 
possible  but  impractical  option  to  parallelize  the  straightforward  digital  filter  convolution  would 
be  to  create  an  8 1  ported  block  of  memory  or  8 1  parallel  blocks  of  memory  that  could  supply  all 
energy  data  to  both  filters  in  a  single  step  as  seen  in  Figure  4.5.  Either  case  would  exceed  the 
memory  resources  of  even  the  most  massive  FPGAs  while  also  potentially  creating  interconnect 
issues.  Furthermore,  even  if  the  memory  issue  was  inconsequential  to  the  straightforward  circular 
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convolution,  there  is  no  way  around  the  overhead  associated  with  needing  at  least  8 1  energy  data 
values  and  at  least  9  more  properly  arranged  to  step  in  either  the  horizontal  or  vertical  directions. 


4.3  Solution 

There  is  an  alternative  way  to  accomplish  the  circular  convolution  that  also  offers  greater 
opportunity  for  parallelization  while  also  slightly  more  memory  friendly  in  reversing  the  digital 
filter.  Convolution  is  a  commutative  mathematical  operation  of  the  form  in  equations  4.1  and  4.2 


f(x)=g(x)*h(x)  or  f(x)=h(x)*g(x)  (4-1) 

where 


y[n)  =  h(n)  * x(n)  =2_Jh(n)  ■  x(n-m)  (4.2) 

m 


where  the  resulting  function  “f  ’  is  either  function  “g”  convolved  with  “h”  or  “h”  convolved  with 
“g.”  In  terms  of  digital  filtering,  this  is  the  same  as  either  convolving  the  filter  with  the  input  or 
convolving  the  input  with  the  fdter.  This  concept  is  critical  in  modifying  the  digital  filtering 
algorithm  so  that  it  can  more  easily  be  implemented  in  hardware.  Typically,  the  fdter  is 
convolved  “overtop”  the  input  data  but  for  parallelizing  purposes  the  input  data  will  be 
convolved  over  the  centroid  of  the  filter.  The  commutative  property  is  important  in  developing 
the  necessary  decomposed  algorithm  for  parallel  computations.  The  decomposed  algorithm  is  a 
function  of  the  accumulated  convolution  and  the  partial  contribution  of  a  single  sample  on  the 
final  output  as  seen  in  equation  4.3.  In  the  decomposed  algorithm  equation,  h(n)  represents  the 
FIR  filter  coefficients,  y  is  the  running  accumulation  and  Y  is  the  next  accumulation  after 
summation.  Notice  that  for  any  m  value,  the  input  x(n)  sample  is  shared  among  all  filter 
coefficients. 


Y  =  y  +  h(  n  —  m )  ■  x(  n) 


(4.3) 


This  approach  is  analogous  to  dropping  the  energy  data  onto  the  centroid  of  the  filter  and 
computing  the  partial  product  of  all  results  within  scope  as  seen  in  Figures  4.6  and  4.7. 
Commuting  the  convolution  eliminates  the  overhead  associated  with  buffering  all  the  energy  data 
into  the  filter  to  calculate  a  single  result  and  greatly  simplifies  the  hardware  implementation. 
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Filter  Kernel  Result  Array 


Figure  4.6:  Parallel  filtering  concept 


Figure  4.7:  Commutated  circular  convolution  example. 
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The  parallelized  digital  filters  maximize  the  computations  for  any  single  8-bits  of  energy 
data  so  that  input  buffering  is  kept  at  an  absolute  minimum  while  also  offering  quite  a  bit  of 
freedom  with  regard  to  the  form  of  the  resulting  data.  Since  the  RED  algorithm  calls  for  a 
circular  convolution,  the  parallelized  digital  filter  computes  the  partial  products  and  writes  the 
results  to  memory  into  the  appropriately  wrapped  memory  locations;  therefore,  more  complex 
convolutions  can  be  handled  simply  by  directing  the  partial  products  to  the  appropriate  memory 
locations.  While  outside  the  scope  of  this  paper,  it  is  likely  possible  to  compute  the  partial  results 
directly  into  the  finalized  template,  totally  eliminating  the  need  for  intermediate  memory  but  that 
is  left  to  the  discretion  of  the  engineer  to  handle  the  filters  in  whatever  way  best  fits  the 
application. 


4.4  TECHNICAL  DISCUSSION 


The  RED  algorithm  digital  filtering  is  designed  and  implemented  into  hardware  (an 
FPGA)  using  Very  Eligh  Speed  Integrated  Circuit  Hardware  Description  Language  (VHDL)  and 
Altera  Quartus  11  web  edition  design  software.  An  FPGA  can  be  programmed  with  a  hardware 
design  and  logically  behave  like  real  hardware.  VHDL  allows  the  engineer  to  describe  the  design 
of  the  system  and  implement  it  onto  the  FPGA  as  a  substrate  for  real-world  testing.  The  Altera 
device  targeted  in  the  design  process  is  the  Stratix  111  FPGA  where  the  digital  filters  consume 
about  6%  of  the  average  Stratix  111  device  resources  and  approximately  4%  of  the  device’s 
available  on-chip  RAM  blocks  (also  known  as  M9K  blocks). 


Vertical  Filter  Controller 


Input  Energy 
Data 


1  Vertical  Filter 

I  Array 

- 

Vertical  Filter 
Memory  Banks 

1  Horizontal  Filter 
|  Array 

Horizontal  Filter 
Memory  Banks 

Horizontal  Filter  Controller 


^  Comparator 
Module 

IMite  to  Final 
Template 


Figure  4.8:  System  block  diagram 


The  parallel  hardware  filter  consists  of  8 1  accumulator  cores  arranged  in  a  nine  by  nine 
array  to  represent  the  original  filter.  The  accumulators  are  grouped  within  the  array  by  row  since 
the  energy  data  from  the  segmentation  process  (or  most  other  common  process)  emerges  left  to 
right,  top  to  bottom,  easing  the  memory  access  as  the  filter  array  steps  through  the  incoming 
data.  Each  row  is  accommodated  with  a  block  of  simple  dual  ported  ram  to  support  all  the 
simultaneous  reading  and  writing  needs  of  the  accumulator  core  row.  Also  among  each  row  is  a 
multiplexer  to  narrow  the  control  of  memory  to  a  single  accumulator  core  and  a  row  controller  to 
grant  any  accumulator  within  the  row  memory  control  and  issue  the  proper  control  signals  to  the 
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row’s  accumulator  cores.  Finally,  the  array  is  fitted  with  a  first-in-first-out  (FIFO)  memory 
buffer  to  ease  incoming  data  and  a  comparator  module  that  judges  which  filter  had  the  highest 
output  and  issues  the  appropriate  bit  to  the  finalized  template  for  matching.  The  full  FIR  filtering 
system  is  shown  in  Figure  4.8. 

The  accumulator  cores  are  the  meat  of  the  parallelized  filter  and  are  responsible  for 
applying  the  filter  coefficients  to  the  incoming  data  and  tracking  the  accumulated  partial  results. 
Each  accumulator  core  is  a  finite  state  machine  (a  basic  type  of  logic  device)  that  is  driven  by 
both  outside  control  signals  and  internal  control  signals.  As  8-bit  energy  data  becomes  available 
from  the  FIFO  buffer,  an  external  “go”  signal  instructs  the  accumulator  core  to  apply  one  of  two 
coefficients  to  the  data  and  add  it  to  the  contents  of  a  16-bit  wide  accumulation  register.  The 
accumulator  cores  also  track  where  the  partial  results  are  stored  in  the  local  bank  of  memory  and 
as  the  energy  data  and  filter  steps  right  1  byte  at  a  time,  one  accumulator  core  (the  left-most  in 
the  filter  scope)  will  go  out  of  scope  and  need  to  exchange  its  accumulation  register  with 
memory  to  come  back  into  scope.  The  address  pointers  for  reading  and  writing  to  memory  are 
updated  at  the  end  of  the  last  accumulation  cycle  that  was  in  scope  and  new  data  is  read  in  and 
accumulated  at  the  start  of  the  next  accumulation  cycle.  Since  data  memory  must  stabilize  before 
being  read,  the  accumulator  cores  cycle  between  an  accumulation  state  and  an  idle  state  and  thus 
requiring  only  two  cycles  to  perform  all  partial  results  of  the  filter.  However,  a  special  case  exists 
when  the  accumulator  core  array  must  step  down  and  return  as  the  segmented  energy  data  moves 
onto  the  next  row  (much  like  the  carriage  return  of  a  typewriter).  During  a  step  down  operation, 
all  accumulators  in  all  rows  must  exchange  data  with  local  row  memory  which  requires  3  cycles 
per  accumulator  (2 1  cycles  total)  which  halts  filter  processing  briefly.  A  brief  block  diagram  of 
the  accumulator  state  machine  is  shown  in  Figure  4.9. 
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Figure  4.9:  Accumulator  core  state  machine 

The  accumulator  cores  are  managed  by  a  simple  driver  driven  by  logic  lookup  tables 
(LUTs)  and  arranged  into  rows  to  maximize  efficiency  and  minimize  the  amount  of  memory 
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used  by  the  filters.  In  each  step  right  operation,  only  the  left  most  accumulator  in  the  scope  (not 
always  the  physical  left  most)  will  save  the  value  in  its  accumulation  register  and  address  a  read 
on  the  data  value  located  in  memory  of  the  next  column  to  come  into  scope.  The  row  controller 
selects  the  appropriate  data  and  address  busses  so  that  the  accumulator  reading  and  writing  data 
has  control  of  the  memory.  The  row  controller  also  handles  the  go  signals  for  the  accumulator 
cores  (informing  the  accumulator  core  that  energy  data  is  stable  coming  from  the  FIFO  buffer) 
and  the  coefficient  select  signals  in  order  to  make  modifying  and  customizing  the  filter  operation 
simpler  and  centralized. 

Once  the  filtering  is  complete,  the  comparator  module  evaluates  the  result  of  each  filter 
and  compiles  the  final  template  ready  for  matching.  The  module  subtracts  the  horizontal  filter 
results  from  the  vertical  filter  at  each  corresponding  memory  location  in  all  row  bank  memories 
simultaneously  and  detects  overflow.  Depending  on  which  filter’s  output  was  highest,  the 
comparator  module  assigns  the  appropriate  bit  to  the  template. 
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Chapter  5 
Results 


5. 1  Binary  Morphology  Hardware  Implementation  Performance 

The  functions  for  computing  the  binary  dilation  and  erosion  are  isolated  and  the  start  and 
end  times  are  recorded  to  compute  the  time  in  milliseconds  the  functions  took  to  execute.  The 
data  is  tabulated  in  Table  5.1.  The  images  tested  are  standard  640  by  480  VGA  images  from  the 
BATH  database. 


Table  5.1:  Timing  comparisons  of  software  and  hardware  implementations.  1000  iterations 
of  the  functions  comprise  these  timing  characteristics. 


Software  Functions 

Hardware  Functions 

Mean  1.341s 

Mean  6.144ms 

Dilation  611.7ms 

Erosion  729.3ms 

Standard  Deviation 

Standard  Deviation 

72.476ms 

313us 

Range 

Range 

1.090s  to  1.470s 

5.943ms  to  6.846ms 

5.2  FIR  Filtering  Hardware  Implementation  Performance 

The  FIR  filtering  hardware  successfully  eliminated  the  filter  kernel  size  impact  on 
execution  time  of  the  template  generation  process  in  the  algorithm.  As  stated  before,  the  filter 
kernel’s  computations  are  loaded,  accumulated  and  saved  in  parallel.  Thus,  the  FIR  filtering  is 
dropped  from  a  fourth  order  growth  function  of  four  nested  loops  to  only  second  order.  With  the 
hardware  FIR  filtering,  only  the  image  size  affects  total  processing  time  and  each  pixel  only 
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requires  between  two  and  three  cycles  to  process.  The  resulting  timing  analysis  is  shown  in 
Figure  5.1. 


Figure  5.1:  FPGA  vs  GPP  on  VGA  input  and  differing  kernel  sizes 


Figure  5.1  illustrates  that  the  sequential  C  code  filtering  is  in  fact  two  orders  of 
magnitude  greater  growth  function  than  the  parallel  FPGA  hardware  implementation.  Also,  since 
the  execution  time  on  the  FPGA  is  in  the  microseconds  and  the  graph  is  time  vs.  the  filter 
dimension  (a  2nd  order  variable),  the  FPGA  timing  appears  as  a  flat  line  across  the  bottom  of  the 
graph.  This  evaluation  is  independent  of  the  actual  process  performed  by  the  RED  algorithm 
which  uses  at  least  16  rows  by  128  columns  to  generate  the  2048  bit  template. 
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Chapter  6 
Conclusion 


Application  specific  hardware  design  can  be  a  complex  process  that  may  or  may  not  be 
worth  the  cost  of  design  for  just  small  enhancements  over  software.  Theoretically,  an  algorithm’s 
execution  could  be  nearly  totally  implemented  with  hardware  functions  and  the  processor  would 
merely  need  to  move  data  from  one  peripheral  hardware  device  to  another,  assuming  sufficient 
space  on  the  programmable  logic  device  or  integrated  circuit.  This  design  is  certainly  not  the 
only  part  of  the  RED  iris  recognition  algorithm  that  can  be  implemented  using  hardware 
functions.  However,  some  things  are  simply  easier  to  do  in  software  even  though  they  may  be 
slower. 


6. 1  Binary  Morphology 

The  hardware  function  outperforms  the  software  in  speed  of  execution.  The  software 
timing  shows  that  the  binary  morphology  dilation  and  erosion  functions  together  take 
approximately  218  times  longer  to  execute.  In  large  processing  applications  that  must  process 
many  images  at  a  time  this  can  be  a  significant  increase.  Furthermore,  the  increase  makes  the 
segmentation  of  the  pupil  from  the  image  nearly  negligible  and  increases  total  segmentation  time 
by  about  37%. 

There  are  tradeoffs,  however,  and  the  hardware  takes  up  additional  space  within  the 
FPGA  which  also  includes  the  microprocessor,  memory  controller  and  other  peripherals.  With 
today’s  synthesizing  software  that  effectively  optimizes  the  design  for  fit  and  power,  the  logic 
consumption  takes  up  less  than  1%  of  the  logic  elements  and  registers  in  the  Stratix  III 
EP3SL150F1 152C3ES  which  has  approximately  150,000  logic  elements  and  1  Mbit  of  on-chip 
memory.  Subsequently,  the  power  consumption  in  the  overall  system  is  reduced  since  the 
hardware  implementation  uses  substantially  less  logic  than  the  software  loops  used. 


6.2  FIR  Filtering 

The  parallelized  digital  filtering  has  many  advantages  over  the  sequential  and 
straightforward  approach  to  the  circular  convolution  filter.  The  parallel  filter  can  be  easily 
inserted  into  a  pipelined  dataflow  where  maximum  processing  for  a  single  byte  of  data  is  desired. 
By  modifying  the  convolution  process  slightly,  the  hardware  is  freed  of  most  overhead 
associated  with  the  sequential  circular  convolution.  Also,  the  memory  usage  is  only  slightly 
greater  in  the  parallelized  convolution  approach  than  the  basic  software  or  basic  hardware 
convolution  while  significantly  less  than  the  parallelized  memory  needed  to  handle  the  hefty 
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reading  requirements  of  front  loaded  digital  filtering  of  the  basic  hardware  convolution. 
Furthermore,  the  partial  approach  to  circular  convolution  presents  many  more  opportunities  for 
optimization  and  customization  since  the  data  flow  can  be  channeled  to  meet  the  needs  of  the 
application. 

Depending  on  the  size  of  the  energy  data  to  be  loaded,  the  time  needed  to  complete  the 
filtering  can  be  easily  calculated.  The  basic  formula  for  how  many  cycles  a  complete  convolution 
requires  is  shown  below  in  equation  7.1: 


t{x, y)  =2xy  -\-27 y  (7.1) 


Where  x  is  the  number  of  columns  and  y  is  the  number  of  rows  in  the  energy  data.  The  equation 
is  asymmetric  because  the  parallelism  is  oriented  with  respect  to  the  rows  (x  direction)  and  thus 
there  is  added  latency  when  the  filter  array  shifts  up  or  down  in  the  y  direction.  Also  note  that  the 
size  of  the  filter  kernel  has  no  effect  on  the  execution  time  and  thus  a  filter  as  large  as  infinity  by 
infinity  could  be  calculated  in  the  same  time  as  a  filter  with  only  a  single  coefficient  (lxl).  For 
the  RED  algorithm,  the  number  of  clock  cycles  needed  to  complete  the  template  filtering  is 
shown  in  equation  7.2  which  is  90.56  microseconds  at  a  50  MHz  clock  frequency.  An  iris  size  of 
128  radial  lines  and  16  concentric  circles  is  chosen  because  this  is  the  minimum  resolution  of  the 
iris  to  compute  the  2048  bit  template. 


r(l28,16)  =4528  cycles  (7.2) 


In  comparison,  simulated  tests  of  filtering  were  performed  on  a  general  purpose  processor 
running  at  4  GHz  and  multiple  cores.  This  machine  took  an  average  of  32  milliseconds  to 
compute  a  9x9  kernel  convolution  on  a  large  VGA  (640x480)  input.  The  same  convolution  was 
tested  on  the  parallel  filtering  array  and  took  only  1.3milliseconds.  This  is  illustrated  in  Figure 
5.1. 


6.3  Wrapping  Up 

The  final  system  could  never  fully  be  tested  because  of  the  design  issues  identified  the 
DE3  board.  Therefore,  the  most  interesting  comparison  of  how  the  overall  embedded  system 
performed  next  to  the  original  general  purpose  machine  could  not  be  evaluated  in  the  available 
time.  Yet,  the  embedded  system  may  one  day  prove  to  be  preferable  over  the  general  purpose 
system  for  many  reasons.  An  embedded  system  is  independent  of  operating  systems  and 
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environments  designed  for  unrelated  computing.  Thus,  this  research  developed  an  embedded 
system  unique  to  iris  recognition  which  is  a  feat  in  itself.  The  implementation  in  this  research  is  a 
proof-of-concept  to  show  that  an  algorithm,  iris  recognition  or  other  that  has  value  being 
portable,  can  be  redesigned  as  a  stand-alone  device.  Lower  power  operation  and  even  more 
efficient  and  faster  operation  than  anything  ever  before  are  likely  possible  given  enough  design 
resources,  money  and  time. 
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Appendix  A:  Embedded  System  Top  Level  VHDL  Code 

library  ieee; 

use  ieee . std_logic_1164 . all; 
library  altera; 

use  altera . altera_syn_attributes . all; 

entity  RED_System  is 
port 
( 

--  Buttons 

Button  :  in  std_logic_vector (3  downto  0) ; 

--  Clock  Out 

CLK_OUT  :  out  std_logic; 

—  DDR2 

mem_addr  :  out  std_logic_vector ( 15  downto  0) ; 
mem_ba  :  out  std_logic_vector (2  downto  0) ; 
memcasn  :  out  std_logic; 

mem_cke  :  out  std_logic_vector ( 1  downto  0) ; 
mem_clk_n  :  inout  std_logic_vector ( 1  downto  0) ; 
mem_clk  :  inout  std_logic_vector ( 1  downto  0) ; 
memcsn  :  out  std_logic_vector ( 1  downto  0) ; 
mem_dm  :  out  std_logic_vector (7  downto  0) ; 
mem_dq  :  inout  std_logic_vector ( 63  downto  0) ; 
mem_dqs  :  inout  std_logic_vector (7  downto  0) ; 
mem_dqsn  :  inout  std_logic_vector (7  downto  0) ; 
mem_odt  :  out  std_logic_vector ( 1  downto  0) ; 
memrasn  :  out  std_logic; 
memwen  :  out  std_logic; 

--  DIP  Switches 

DIP_SW  :  in  std_logic_vector (7  downto  0) ; 

--  Clocks 

EXT_CLK  :  in  std_logic; 

OSC1_50  :  in  std_logic; 

OSC2_50  :  in  std_logic; 

OSC_BA  :  in  std_logic; 

OSC_BB  :  in  std_logic; 

OSC_BC  :  in  std_logic; 

OSC_BD  :  in  std_logic; 

--  Seven  Segment 

HEX0  :  out  std_logic_vector (6  downto  0) ; 

HEX0_DP  :  out  std_logic; 

HEX1  :  out  std_logic_vector (6  downto  0) ; 

HEX1_DP  :  out  std_logic; 

—  JTAG 

JVC_CLK  :  out  std_logic; 

JVC_CS  :  out  std_logic; 

JVC_DATAIN  :  in  std_logic; 

JVC_DATAOUT  :  out  std_logic; 

—  LEDs 

LEDB  :  out  std_logic_vector (7  downto  0) ; 

LEDG  :  out  std_logic_vector  (7  downto  0) ; 

LEDR  :  out  std_logic_vector (7  downto  0) ; 

--  USB 

OTG_A  :  out  std_logic_vector (17  downto  1) ; 

OTG_CS_n  :  out  std_logic; 

OTG_D  :  inout  std_logic_vector (31  downto  0); 


OTG_DC_DACK  :  out  std_logic; 
OTG_DC_DREQ  :  in  std_logic; 

OTG_DC_IRQ  :  in  std_logic; 

OTG_HC_DACK  :  out  std_logic; 
OTG_HC_DREQ  :  in  std_logic; 

OTG_HC_IRQ  :  in  std_logic; 

OTG_OE_n  :  out  std_logic; 

OTG_RESET_n  :  out  std_logic; 

OTG_WE_n  :  out  std_logic; 

—  SD  Card 

SD_CLK  :  out  std_logic; 

SD_CMD  :  inout  std_logic; 

SD_DAT  :  inout  std_logic; 

SD_WPn  :  in  std_logic; 

--  Slide  Switches 

SW  :  in  std_logic_vector (3  downto  0) ; 

--  Temp  Sensor 
TEMP_CLK  :  out  std_logic; 

TEMP_DATA  :  inout  std_logic; 

TEMP_INTn  :  in  std_logic 


)  ; 

end  RED_System; 

architecture  ppl_type  of  RED_System  is 

signal  usb_addr  :  std_logic_vector (16  downto  0) ; 


10  Group  Voltage  Configuration  (Do  not  modify  it) 


component  I0V_A3V3_B1V8_C3V3_D3V3  is 
port  ( 


signal 

iCLK  : 

IN 

STD 

LOGIC; 

signal 

iRST  n  : 

IN 

STD 

LOGIC; 

signal 

iENABLE  : 

IN 

STD 

LOGIC; 

signal 

oREADY  : 

OUT 

STD 

_LOGIC; 

signal 

OERR 

OUT  STD 

LOGIC 

signal 

oERRCODE 

OUT  STD 

LOGIC 

signal 

O JVC_CLK 

OUT  STD 

LOGIC 

signal 

oJVC_CS  : 

OUT 

STD 

_LOGIC; 

signal 

O  JV  C_DAT AOUT 

OUT  STD 

LOGIC 

signal 

i JVC_DATAIN 

IN  STD  LOGIC 

)  ; 

end  component; 


component  RED_S0PC  is 
port  ( 

--  1)  global  signals: 

signal  altmemddr_aux_full_rate_clk_out  :  OUT  STD_L0GIC; 
signal  altmemddr_aux_half_rate_clk_out  :  OUT  STD_L0GIC; 
signal  altmemddr_phy_clk_out  :  OUT  STD_L0GIC; 
signal  elk  :  IN  STD_L0GIC; 
signal  resetn  :  IN  STD_L0GIC; 

--  the_ISP1761 

signal  A_from_the_ISP1761  :  OUT  STD_LOGIC_VECTOR  (17  DOWNTO  1); 
signal  CS_N_f rom_the_ISP1761  :  OUT  STD_LOGIC; 
signal  DC_DACK_f rom_the_ISP1761  :  OUT  STD_LOGIC; 
signal  DC_DREQ_to_the_ISP1761  :  IN  STD_LOGIC; 
signal  DC_IRQ_to_the_ISP1761  :  IN  STD_LOGIC; 

signal  D  to  and  from  the  ISP1761  :  INOUT  STD  LOGIC  VECTOR  (31  DOWNTO  0) 
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signal  HC_DACK_from_the_ISP1761  :  OUT  STD_LOGIC; 
signal  HC_DREQ_to_the_ISP1761  :  IN  STD_LOGIC; 
signal  HC_IRQ_to_the_ISP1761  :  IN  STD_LOGIC; 
signal  RD_N_from_the_ISP1761  :  OUT  STD_LOGIC; 
signal  WR_N_from_the_ISP1761  :  OUT  STD_LOGIC; 

--  the_SEG7 

signal  avs_sl_export_seg7_f rom_the_SEG7  :  OUT  STD_LOGIC_VECTOR  (15  DOWNTO  0) ; 

--  the  altmemddr 

signal  global_reset_n_to_the_altmemddr  :  IN  STD_LOGIC; 

signal  local_init_done_f rom_the_altmemddr  :  OUT  STD_LOGIC; 

signal  local_refresh_ack_f rom_the_altmemddr  :  OUT  STD_LOGIC; 

signal  local_wdata_req_f rom_the_altmemddr  :  OUT  STD_LOGIC; 

signal  mem_addr_f rom_the_altmemddr  :  OUT  STD_LOGIC_VECTOR  (12  DOWNTO  0) ; 

signal  mem_ba_f rom_the_altmemddr  :  OUT  STD_LOGIC_VECTOR  (1  DOWNTO  0) ; 

signal  mem_cas_n_f rom_the_altmemddr  :  OUT  STD_LOGIC; 

signal  mem_cke_f rom_the_altmemddr  :  OUT  STD_LOGIC; 

signal  mem_clk_n_to_and_f rom_the_altmemddr  :  INOUT  STD_LOGIC_VECTOR  (1  DOWNTO 

0); 

signal  mem_clk_to_and_from_the_altmemddr  :  INOUT  STD_LOGIC_VECTOR  (1  DOWNTO  0) ; 
signal  mem_cs_n_f rom_the_altmemddr  :  OUT  STD_LOGIC; 

signal  mem_dm_from_the_altmemddr  :  OUT  STD_LOGIC_VECTOR  (7  DOWNTO  0) ; 

signal  mem_dq_to_and_f rom_the_altmemddr  :  INOUT  STD_LOGIC_VECTOR  (63  DOWNTO  0) ; 

signal  mem_dqs_to_and_f rom_the_altmemddr  :  INOUT  STD_LOGIC_VECTOR  (7  DOWNTO  0) ; 

signal  mem_dqsn_to_and_f rom_the_altmemddr  :  INOUT  STD_LOGIC_VECTOR  (7  DOWNTO  0) ; 

signal  mem_odt_f rom_the_altmemddr  :  OUT  STD_LOGIC; 

signal  mem_ras_n_f rom_the_altmemddr  :  OUT  STD_LOGIC; 

signal  memwenf rom_the_altmemddr  :  OUT  STD_LOGIC; 

signal  oct_ctl_rs_value_to_the_altmemddr  :  IN  STD_LOGIC_VECTOR  (13  DOWNTO  0) ; 
signal  oct_ctl_rt_value_to_the_altmemddr  :  IN  STD_LOGIC_VECTOR  (13  DOWNTO  0) ; 
signal  reset_phy_clk_n_from_the_aitmemddr  :  OUT  STD_LOGIC; 

--  the_pio_button 

signal  in_port_to_the_pio_button  :  IN  STD_LOGIC_VECTOR  (3  DOWNTO  0) ; 

--  the_pio_led 

signal  out_port_f rom_the_pio_led  :  OUT  STD_LOGIC_VECTOR  (23  DOWNTO  0) ; 

--  the_usb_reset_n 

signal  out_port_f rom_the_usb_reset_n  :  OUT  STD_LOGIC 
tr¬ 
end  component; 

begin 

IOV_Inst :  I0V_A3V3_B1V8_C3V3_D3V3 
port  map  ( 

iCLK  =>  OSC2_50, 
iRST_n  =>  '  1 '  , 
iENABLE  =>  1 1 1 , 

— READY  =>  , 

— ERR  =>  , 

— ERRCODE  =>  , 

OjVC_CLK  =>  JVCJ3LK, 

OjVC_CS  =>  JVC_CS, 

OJVC_DATAOUT  =>  JVC_DATAOUT, 
iJVC_DATAIN  =>  JVC_DATAIN 

)  ; 


RED_inst :  RED_SOPC 
port  map 
( 


--  1)  global  signals: 
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--  altmemddr_aux_full_rate_clk_out  =>  <signal>,  --  Unused 

--  altmemddr_aux_half_rate_clk_out  =>  <signal>,  --  Unused 

--  altmemddr_phy_clk_out  =>  <signal>, 
elk  =>  OSC1_50 , 
reset_n  =>  '  1 '  , 


--  the  USB 

A_f rom_the_ISP17 61 ( 17  downto  1)  =>  OTG_A(17  downto  1) , 
cs_N_from_the_ISP1761  =>  OTG_CS_n, 

DC_DACK_f rom_the_ISP17 61  =>  OTG_DC_DACK, 
DC_DREQ_to_the_ISP17  61  =>  OTG_DC_DREQ , 
DC_IRQ_to_the_ISP1761  =>  OTG_DC_IRQ, 

D_to_and_f rom_the_ISP17 61  =>  OTG_D, 

HC_DACK_f rom_the_ISP17 61  =>  OTG_HC_DACK , 
HC_DREQ_to_the_ISP17  61  =>  OTG_HC_DREQ , 
HC_IRQ_to_the_ISP17  61  =>  OTG_HC_IRQ, 
RD_N_from_the_ISP1761  =>  OTG_OE_n, 
WR_N_from_the_ISP1761  =>  OTG_WE_n, 
out_port_f rom_the_usb_reset_n  =>  OTG_RESET_n, 

--  SEG7 

avs_sl_export_seg7_f rom_the_SEG7 (15)  =>  HEX1_DP, 
avs_sl_export_seg7_f rom_the_SEG7 (14  downto  8)  =>  HEX1, 
avs_sl_export_seg7_f rom_the_SEG7 (7)  =>  HEX0_DP, 
avs_sl_export_seg7_f rom_the_SEG7 (6  downto  0)  =>  HEXO, 


--  Buttons 

in_port_to_the_pio_button  =>  Button, 


—  LEDs 

out_port_f rom_the_pio_led (23  downto  16)  =>  LEDR, 
out_port_f rom_the_pio_led ( 15  downto  8)  =>  LEDG, 
out_port_from_the_pio_led (7  downto  0)  =>  LEDB, 


--  the_altmemddr 

global_reset_n_to_the_altmemddr  =>  ' 1 '  , 
mem_addr_f rom_the_altmemddr  =>  mem_addr(12  downto  0), 
mem_ba_f rom_the_altmemddr  =>  mem_ba(l  downto  0), 
mem_cas_n_f rom_the_altmemddr  =>  mem_cas_n, 
mem_cke_f rom_the_altmemddr  =>  mem_cke(0), 
mem_clk_n_to_and_f rom_the_altmemddr  =>  mem_clk_n, 
mem_clk_to_and_f rom_the_altmemddr  =>  mem_clk, 
mem_cs_n_f rom_the_altmemddr  =>  mem_cs_n(0), 
mem_dm_f rom_the_altmemddr  =>  mem_dm, 
mem_dq_to_and_f rom_the_altmemddr  =>  mem_dq, 
mem_dqs_to_and_f rom_the_altmemddr  =>  mem_dqs, 
mem_dqsn_to_and_f rom_the_altmemddr  =>  mem_dqsn, 
mem_odt_f rom_the_altmemddr  =>  mem_odt(0), 
mem_ras_n_f rom_the_altmemddr  =>  mem_ras_n, 
mem_we_n_f rom_the_altmemddr  =>  mem_we_n, 

oct_ctl_rs_value_to_the_altmemddr  =>  " 00000000000000" , 
oct  ctl  rt  value  to  the  altmemddr  =>  "00000000000000" 


)  ; 


—  Unused 


end; 


Appendix  B:  Binary  Morphology  VHDL  Code 


library  ieee; 

use  ieee . std_logic_1164 . all; 

entity  BM_Dilate_Erode  is 

generic 

( 

ROWS  :  natural  :=  4; 

COLS  :  natural  :=  4 

)  ; 

port 

( 

avs_sl_clk  :  in  std_logic; 

--avs_sl_reset_n  :  in  std_logic  :=  '1'; 
avs_sl_read  :  in  std_logic; 
avs_sl_write  :  in  std_logic; 
avs_sl_chipselect  :  in  std_logic; 
avs_sl_address  :  in  std_logic; 

avs_sl_readdata  :  out  std_logic_vector (7  downto  0) ; 
avs_sl_writedata  :  in  std_logic_vector (7  downto  0) 

)  ; 

end  BM_Dilate_Erode; 

architecture  rtl  of  BM_Dilate_Erode  is 

signal  loop_mux,  loop_data  :  std_logic  :=  'O'; 

signal  local_data  :  std_logic_vector (ROWS*COLS  downto  0) ; 

signal  local_interconnect  :  std_logic_vector (ROWS*COLS  downto  0) 


signal 

elk  e  : 

std  logic  : 

=  '  0  '  ; 

signal 

shift  load 

:  std 

logic 

:=  '0' 

signal 

done 

:  std 

logic 

:=  '0' 

signal 

operation 

:  std 

logic 

:=  '0' 

component  BM_Element  is 


port  ( 

—  synchronous 
elk 

clk_enable 

--  asynchronous 

--reset 

aload 

adatain 

adataout 

--  input 

--toplef t 

topmid 

--topright 

midleft 

midright 

--bottomleft 

bottommid 

--bottomright 

operation 


:  in  std_logic; 

in  std_logic; 


in  std_logic; 
in  std_logic; 
in  std_logic; 
out  std_logic; 


:  in  std_logic; 

in  std_logic; 

in  std_logic; 

in  std_logic; 

in  std_logic; 

in  std_logic; 

in  std_logic; 

in  std_logic; 

in  std_logic; 


output 


output 


out 


std_logic 
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) ; 

end  component; 


begin 


--  Track  number  of  pixels 

--  Asynchronous  reset 


controller:  process (avs_sl_clk,  avs_sl_chipselect,  avs_sl_write) 
variable  dilate_erode  :  natural  :=  0;  --  30  Dilate/Erode 
variable  count  :  natural  :=  0; 
loaded  and  unloaded 
begin 

--if (avs_sl_reset_n  =  '0')  then 
--dilate_erode  :=  0; 

--count  :=  0; 

--elsif (rising_edge (avs_sl_clk) )  then 
if (rising_edge (avs_sl_clk) )  then 

if (avs_sl_chipselect  =  *  1  *  and  avs_sl_write  =  '1')  then 

are  being  written  to 

if (count  <  ROWS*COLS)  then 

--  Only  allow  rows*cols  number  of  pixels  in 
local_data (0)  <=  avs_sl_writedata (0) ; 

--  Place  write  data  into  first  shift  BM_Element 
shift_load  <=  '1'; 

--  Shift  loading  =  TRUE 
count  :=  count  +  1; 

--  Increment  pixels  count 

else 

--  Written  to? 
count  :=  ROWS* COLS ; 

--  Error:  Hold  at  same  number  of  pixels  in  image 
shift_load  <=  'O'; 

--  Do  not  shift 

end  if; 

elsif (avs_sl_chipselect  =  ’ 1 '  and  avs_sl_read  =  ' 1 ' )  then 


being  read  from 


--  We  are 


if (avs_sl_address  =  '0')  then 
--  Reading. . 

if (count  >  0  and  dilate_erode  =  30)  then 

--  Still  data? 

avs_sl_readdata (0)  <=  local_data (ROWS*COLS) ; 

--  Place  data  out 

shift_load  <=  1 1 ' ; 

--  Shift  data  out 
count  :=  count  -  1; 

--  Decremenet  pixels  count 


else 


avs_sl_readdata (0)  <=  'O'; 
--  Don't  output 
shift_load  <=  'O'; 

--  Don't  shift 
count  :=  0; 

--  Keep  count  zero 

end  if; 

elsif (avs_sl_address  =  '1')  then 
--  Polling. . . 

if (dilate_erode  =  30)  then 
--  Done? 

avs_sl_readdata (0)  <=  ' 1 ' ; 
—  Yes 

else 

avs_sl_readdata (0)  <=  'O'; 
—  No 

end  if; 

end  if; 


We 


end  if; 
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if (dilate_erode  >  14)  then 
operation  <=  1 1 1 ; 

else 


operation  <=  'O'; 

end  if; 


if  (count  =  ROWS*COLS  and  dilate_erode  <  30)  then 

Processing 

clk_e  <=  ' 1 ' ; 

dilate_erode  :=  dilate_erode  t  1; 
elsif (count  =  ROWS*COLS  and  dilate_erode  =  30)  then 
clk_e  <=  'O'; 
dilate_erode  :=  30; 

end  if; 


end  if; 
end  process; 


BM_rows :  for  j  in  0  to  ROWS  -  1  generate 
begin 


BM_cols:  for  i  in  0  to  COLS  -  1  generate 
begin 

--loop_mux  <=  local_data ( j *COLS  +  i)  when  i+j  >  0  else  loop_data; 
U1 :  BM  Element 


port  map ( 

elk  =>  avs_sl_clk, 
clk_enable  =>  clk_e, 
operation  =>  operation, 

--  asynchronous 

--reset  =>  avs_sl_reset_n, 

aload  =>  shift_load, 

adatain  =>  local_data ( j *COLS  +  i)  , 

adataout  =>  local_data ( j *COLS  +  i  t  1), 


(  (i  t  COLS  -  1)  mod  COLS)  ), 
+  COLS)  mod  COLS)  ), 

(  (i  t  COLS  t  1)  mod  COLS)  ), 


--  top 

--topleft  =>  local_interconnect (  ( (j  +  ROWS  -  1)  mod  ROWS)*COLS  + 
topmid  =>  local_interconnect  (  ( (j  +  ROWS  -  1)  mod  ROWS)*COLS  +  ( (i 
--topright  =>  local_interconnect (  ( (j  t  ROWS  -  1)  mod  ROWS)*COLS  + 


COLS)  ), 
COLS)  ), 


--  mid 

midleft  =>  local_interconnect (  (j*COLS)  +  ( (it (COLS-1) )  mod 
midright  =>  local_interconnect (  (j*COLS)  +  ( (it (COLStl) )  mod 


t  ( (i  t  COLS  -  1)  mod  COLS) 
( (i  t  COLS)  mod  COLS)  ) , 
t  ( (i  t  COLS  t  1)  mod  COLS) 


--  bottom 

--bottomleft  =>  local_interconnect (  ( (j  t  ROWS  t  1)  mod  ROWS)*COLS 

)  , 

bottommid  =>  local_interconnect (  ( (j  t  ROWS  t  1)  mod  ROWS)*COLS  t 
--bottomright  =>  local_interconnect (  ( (j  t  ROWS  t  1)  mod  ROWS)*COLS 

)  , 


output 
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element 


output  =>  local_interconnect (  j*COLS  +  i)  --  point  to  current 


)  ; 

end  generate; 


end  generate; 


--outmemory  <=  local_interconnect; 

end  rtl; 


library  ieee; 

use  ieee . std_logic_1164 . all; 
entity  BM_Element  is 


port  ( 

--  synchronous 
elk 

clk_enable 

--  asynchronous 

--reset 

aload 

adatain 

adataout 

--  input 

--toplef t 

topmid 

--topright 

midleft 

midright 

--bottomleft 

bottommid 

--bottomright 

operation 

--  output 
output 


end  entity; 


in 

in 

in 

out 


in 

in 

in 

in 

in 

in 

in 

in 


out 


architecture  rtl  of  BM  Element  is 


signal  pixel,  pixel_step 
—  signal  pixel_step 

begin 


in  std_logic; 

std_logic; 


std_logic; 

std_logic; 

std_logic; 

std_logic; 


in  std_logic; 

std_logic; 

std_logic; 

std_logic; 

std_logic; 

std_logic; 

std_logic; 

std_logic; 

std_logic; 


std_logic 


std_logic; 

std_logic; 


process (elk,  pixel,  pixel_step) 
begin 

pixel  <=  pixel_step; 
end  process; 

--  In  Altera  devices,  register  signals  have  a  set  priority. 

--  The  HDL  design  should  reflect  this  priority. 

process (aload,  adatain,  elk) 

begin 

--  The  asynchronous  reset  signal  has  the  highest  priority 
--if  (reset  =  '0')  then 

--pixel_step  <=  'O'; 

--  Asynchronous  load  has  next-highest  priority 
if  (aload  =  '1')  then 

adataout  <=  pixel; 
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pixel_step  <=  adatain; 

else 

--  At  a  clock  edge,  if  asynchronous  signals  have  not  taken  priority, 

--  respond  to  the  appropriate  synchronous  signal. 

--  Check  for  synchronous  reset,  then  synchronous  load. 

--  If  none  of  these  takes  precedence,  update  the  register  output 
--  to  be  the  register  input, 
if  (rising_edge (elk) )  then 

if (clk_enable  =  '1')  then 

if (operation  =  '0')  then  --  Dilate 

--pixel_step  <=  topleft  or  topmid  or  topright  or  midleft  or 
midright  or  bottomleft  or  bottommid  or  bottomright; 

pixel_step  <=topmid  or  midleft  or  midright  or  bottommid; 

else 


pixel_step  <=  not (topmid  or  midleft  or  midright  or 


bottommid) ; 


end  if; 


end  if; 


end  if; 


end  if; 
end  process; 


process (elk) 
begin 

if (rising_edge (elk) )  then 

if (operation  =  '0')  then 
output  <=  pixel; 

else 


output  <=  not (pixel); 

end  if; 

end  if; 
end  process; 


end  rtl; 


56 


Appendix  C:  FIR  Filtering  VHDL  Code 

library  ieee; 

use  ieee . std_logic_1164 . all; 

USE  ieee . std_logic_unsigned . ALL; 
use  ieee . numeric_std . all ; 

entity  TemplateTest  is 
generic 
( 

ADDRESSJ/JIDTH  :  natural  :=  8; 

DATA_WIDTH  :  natural  :=  8 

)  ; 


port 

( 

--  Input  ports 

clock  :  in  std_logic; 

go_input  :  in  std_logic  :=  'O'; 

reset_in  :  in  std_logic; 

data_input  :  in  std_logic_vector ( (DATA_WIDTH-1 )  downto  0) ; 


--  Output  ports 

wide_data_out  :  out  std_logic_vector ( (9*2*DATA_WIDTH-1)  downto  0)  ; 
wide_wr_out  :  out  std_logic_vector ( 8  downto  0) 

)  ; 

end  TemplateTest; 


architecture  rtl  of  TemplateTest  is 
--  16  bit  memory 

signal  a_data  :  std_logic_vector ( (DATA_WIDTH-1)  downto  0) ; 
signal  b_data  :  std_logic_vector ( (DATA_WIDTH-1)  downto  0) ; 
signal  data_write_bus  :  std_logic_vector ( (9*2*DATA_WIDTH-1)  downto  0) ; 
signal  data_read_bus  :  std_logic_vector ( (9*2*DATA_WIDTH-1)  downto  0) ; 


signal  wr  :  std_logic_vector ( 8  downto  0)  :=  "000000000"; 

signal  sel_col  :  natural  :=  0; 

signal  addr_write_bus  :  std_logic_vector ( (9*ADDRESS_WIDTH-1)  downto  0) ; 
signal  addr_read_bus  :  std_logic_vector ( (9*ADDRESS_WIDTH-1)  downto  0) ; 
signal  square_data  :  std_logic_vector ( (2*DATA_WIDTH-1)  downto  0) ; 


component  TemplateElement 

generic 

( 

DATA_WIDTH  :  natural  :=  8; 
ADDRESS_WIDTH  :  natural  :=  8; 
INITIAL_COL  :  natural; 
INITIAL_ROW  :  natural; 
ARRAY_COL  :  natural; 

ARRAY_ROW  :  natural 

)  ; 


port 

( 
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go 

coef  f 

:  in 

:  in  std 
std  logic  : 

logic 
=  'O'; 

new  row  in 

:  in 

std  logic  : 

=  'O'; 

reset 

:  in 

std  logic  : 

=  'O'; 

elk 

:  in  std 

logic 

--  16  bit  memory 

a  :  in  std_logic_vector  ( (DATA_WIDTH-1 )  downto  0); 

b  :  in  std_logic_vector  ( (DATA_WIDTH-1 )  downto  0); 

data_in  :  in  std_logic_vector  ( (2*DATA_WIDTH-1 )  downto  0) ; 

data_out  :  out  std_logic_vector  ( (2*DATA_WIDTH-1 )  downto  0) ; 


read_write  :  out  std_logic_vector ( 0  downto  0) ; 

addr_in  :  out  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) ; 

addr_out  :  out  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) 


end  component; 

component  RowBankMemory  is 

generic 

( 

DATA_WIDTH  :  natural  :=  8; 

ADDR_WIDTH  :  natural  :=  8 

) ; 

port 

( 

elk  :  in  std_logic; 

raddr  :  in  natural  range  0  to  2  *  * ADDR_WI DTH  -  1; 
waddr  :  in  natural  range  0  to  2**ADDR_WIDTH  -  1; 

--  16  bit  memory 

data  :  in  std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0); 
q  :  out  std_logic_vector ( (2*DATA_WIDTH  -1)  downto  0) ; 


we  :  in  std_logic_vector ( 0  downto  0)  :=  "0 


end  component; 

component  RowMUX  is 
generic 
( 

DATA_WIDTH  :  natural  :=  8; 

ADDRESS  WIDTH  :  natural  :=  8 


port 

( 

--  Input  ports 
data_write_busO  :  in 

data_write_busl  :  in 

data_write_bus2  :  in 

data_write_bus3  :  in 

data_write_bus4  :  in 

data_write_bus5  :  in 

data_write_bus6  :  in 

data_write_bus7  :  in 

data  write  bus8  :  in 


std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0) ; 
std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0) ; 
std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0) ; 
std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0) ; 
std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0) ; 
std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0) ; 
std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0) ; 
std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0) ; 
std  logic  vector ( (2*DATA  WIDTH-1)  downto  0) ; 


addr_write_busO  :  in  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) ; 
addr_write_busl  :  in  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) ; 
addr_write_bus2  :  in  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) ; 
addr_write_bus3  :  in  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) ; 
addr_write_bus4  :  in  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) ; 
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addr_write_bus5  :  in  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) ; 
addr_write_bus6  :  in  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) ; 
addr_write_bus7  :  in  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 
addr_write_bus8  :  in  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) ; 

addr_read_busO  :  in  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 

addr_read_busl  :  in  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 

addr_read_bus2  :  in  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 

addr_read_bus3  :  in  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 

addr_read_bus4  :  in  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 

addr_read_bus5  :  in  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 

addr_read_bus6  :  in  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 

addr_read_bus7  :  in  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 

addr_read_bus8  :  in  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 

wrO  :  in  std_logic_vector  ( 0  downto  0) ; 
wrl  :  in  std_logic_vector ( 0  downto  0) ; 
wr2  :  in  std_logic_vector ( 0  downto  0) ; 
wr3  :  in  std_logic_vector ( 0  downto  0) ; 
wr4  :  in  std_logic_vector ( 0  downto  0)  ; 
wr5  :  in  std_logic_vector ( 0  downto  0) ; 
wr6  :  in  std_logic_vector ( 0  downto  0) ; 
wr7  :  in  std_logic_vector ( 0  downto  0)  ; 
wr8  :  in  std_logic_vector ( 0  downto  0) ; 

col_sel  :  natural; 

--  Output  ports 

data_write_bus_out  :  out  std_logic_vector ( (2*DATA_WIDTH-1)  downto  0) ; 

addr_write_bus_out  :  out  std_logic_vector ( (ADDRESS_WIDTH-1 )  downto  0) ; 

addr_read_bus_out  :  out  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) ; 

wr_out  :  out  std_logic_vector ( 0  downto  0) 

)  ; 

end  component; 

component  FilterRowController  is 

generic 

( 

ROW_INIT  :  natural  :=  0 

)  ; 

port 

( 

--  Input  ports 
go 

--  reset 
elk 

--  Output  ports 
go_cmd  :  out  std_logic; 

oper  :  out  std_logic_vector ( 8  downto  0) ; 

--  new_col_out  :  out  std_logic_vector ( 8  downto  0) ; 
new_row_out  :  out  std_logic_vector ( 8  downto  0) ; 

col_sel  :  out  std_logic_vector (3  downto  0) 

--  col_count_out  :  out  std_logic_vector (7  downto  0) 

)  ; 

end  component; 


begin 

process (data_input,  square_data,  go_input) 
begin 

—  square_data  <=  data_input*data_input; 


in  std_logic; 
in  std_logic; 
in  std  logic; 


16  bit  memory 
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--  Coefficient  of  1 

a_data  <=  data_input ( (DATA_WIDTH-1 )  downto  0); 

--  Coefficient  of  -0.5 

b_data (DATA_WIDTH-1 )  <=  'O';  —  not (square_data (2*DATA_WIDTH-1) ) ; 
b_data ( (DATA_WIDTH-2 )  downto  0)  <=  data_input ( (DATA_WIDTH-1 )  downto  1);  -- 
not (square_data ( (2*DATA_WIDTH-1 )  downto  DATA_WIDTH+1 ) ) +1; 


end  process; 


filter_rows:  for  j  in  0  to  0  generate  --  12  28 


--  Local  Row  Signals 
signal  local_wr 
signal  local_wr_data 
signal  local_wr_addr 
signal  local_rd_addr 
signal  col_control 
signal  row_control 
signal  oper_control 
signal  go_control 
signal  col_selects 


:  std_logic_vector (8  downto  0)  ; 
std_logic_vector ( ( 9*2*DATA_WIDTH-1 )  downto  0); 
std_logic_vector ( ( 9*ADDRESS_WIDTH-1 )  downto  0) ; 
std_logic_vector ( ( 9*ADDRESS_WIDTH-1 )  downto  0) ; 
std_logic_vector (8  downto  0) ; 
std_logic_vector (8  downto  0) ; 
std_logic_vector (8  downto  0)  ; 
std_logic; 

std_logic_vector (3  downto  0) ; 


begin 


filter_cols:  for  i  in  0  to  8  generate  --  125  252 


begin 

U1 :  TemplateElement2 
generic  map ( 

INITIAL_COL  =>  ( (i  +  124)  mod  128), 
I N I T I AL_RO W  =>  ( ( j  +  12)  mod  16), 
ARRAY_COL  =>  i, 

ARRAY_ROW  =>  j 


port  map ( 

go  =>  go_control, 
coeff  =>  oper_control (i)  , 

--  new_col  =>  col_control (i) , 
new_row_in  =>  row_control (i) , 
elk  =>  clock, 
a  =>  a_data, 
b  =>  b_data, 
reset  =>  reset_in, 

data_in  =>  data_read_bus ( ( j *16  +  (2*DATA_WIDTH-1 ) )  downto  j*16), 
read_write  =>  local_wr(i  downto  i) , 

data_out  =>  local_wr_data ( ( (i+1) *2*DATA_WIDTH-1)  downto 
addr_out  =>  local_wr_addr ( ( (i+1) *ADDRESS_WIDTH-1 )  downto 
addr_in  =>  local_rd_addr ( ( (i+1 ) *ADDRESS_WIDTH-1 )  downto 


(i*2*DATA_WIDTH) )  , 

( i*ADDRESS_WIDTH)  )  , 
( i*ADDRESS_WIDTH)  ) 


wr(j  downto  j)  <=  local_wr  when  sel_col  =  i  else  "Z"; 

data_write_bus ( ( j *2*DATA_WIDTH  +  (2*DATA_WIDTH-1) )  downto  j *2*DATA_WIDTH) 
<=  local_wr_data  when  sel_col  =  i  else  "ZZZZZZZZZZZZZZZZ" ; 

addr_write_bus ( ( j *ADDRESS_WIDTH  +  (ADDRESS_WIDTH-1) )  downto 
( j *ADDRESS_WIDTH) )  <=  local_wr_addr  when  sel_col  =  i  else  "ZZZZZZZZ"; 

addr_read_bus ( ( j *ADDRESS_WIDTH  +  (ADDRESS_WIDTH-1 ) )  downto 
( j *ADDRESS_WIDTH) )  <=  local_rd_addr  when  sel_col  =  i  else  "ZZZZZZZZ"; 
end  generate; 
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U4 :  VertFilterRowController 
generic  map ( 

R0W_INIT  =>  j 

) 

port  map ( 

go  =>  go_input, 

--  reset  =>  reset_in, 
elk  =>  clock. 


--  Output  ports 
go_cmd  =>  go_control, 
oper  =>  oper_control, 

--  new_col_out  =>  col_control, 
newrowout  =>  row_control, 
col  sel  =>  col  selects 


U3 :  RowMUX 


port 

(0*2*DATA_WIDTH)  )  , 
(1*2*DATA_WIDTH)  )  , 
(2*2*DATA_WIDTH)  )  , 
(3*2*DATA_WIDTH)  )  , 
(4*2*DATA_WIDTH)  )  , 
(5*2*DATA_WIDTH)  )  , 
(6*2*DATA_WIDTH)  )  , 
(7*2*DATA_WIDTH)  )  , 
(8*2*DATA_WIDTH)  )  , 


map  ( 

data_write_busO 
data_write_busl 
data_write_bus2 
data_write_bus3 
data_write_bus4 
data_write_bus5 
data_write_bus6 
data_write_bus7 
data  write  bus8 


=>  local_wr_data ( (1*2*DATA_WIDTH-1)  downto 
=>  local_wr_data ( (2*2*DATA_WIDTH-1)  downto 
=>  local_wr_data ( (3*2*DATA_WIDTH-1)  downto 
=>  local_wr_data ( (4*2*DATA_WIDTH-1)  downto 
=>  local_wr_data ( (5*2*DATA_WIDTH-1)  downto 
=>  local_wr_data ( (6*2*DATA_WIDTH-1)  downto 
=>  local_wr_data ( (7*2*DATA_WIDTH-1)  downto 
=>  local_wr_data ( (8*2*DATA_WIDTH-1)  downto 
=>  local_wr_data ( (9*2*DATA_WIDTH-1)  downto 


(0*ADDRESS_WIDTH) ) , 
(1*ADDRESS_WIDTH) ) , 
(2*ADDRESS_WIDTH) ) , 
(3*ADDRESS_WIDTH) ) , 
(4*ADDRESS_WIDTH) ) , 
(5*ADDRESS_WIDTH) ) , 
(6*ADDRESS_WIDTH) ) , 
(7*ADDRESS_WIDTH) ) , 
(8*ADDRESS_WIDTH) ) , 


addr_write_busO  =>  local_wr_addr ( (1*ADDRESS_WIDTH-1)  downto 
addr_write_busl  =>  local_wr_addr ( (2*ADDRESS_WIDTH-1 )  downto 
addr_write_bus2  =>  local_wr_addr ( (3*ADDRESS_WIDTH-1)  downto 
addr_write_bus3  =>  local_wr_addr ( (4*ADDRESS_WIDTH-1)  downto 
addr_write_bus4  =>  local_wr_addr ( (5*ADDRESS_WIDTH-1 )  downto 
addr_write_bus5  =>  local_wr_addr ( (6*ADDRESS_WIDTH-1)  downto 
addr_write_bus6  =>  local_wr_addr ( (7*ADDRESS_WIDTH-1)  downto 
addr_write_bus7  =>  local_wr_addr ( ( 8*ADDRESS_WIDTH-1 )  downto 
addr_write_bus8  =>  local_wr_addr ( (9*ADDRESS_WIDTH-1)  downto 


(0*ADDRESS_WIDTH) ) , 
(1*ADDRESS_WIDTH) ) , 
(2*ADDRESS_WIDTH) ) , 
(3*ADDRESS_WIDTH) ) , 
(4*ADDRESS_WIDTH) ) , 


addr_read_busO  =>  local_rd_addr ( (1*ADDRESS_WIDTH-1)  downto 
addr_read_busl  =>  local_rd_addr ( (2*ADDRESS_WIDTH-1)  downto 
addr_read_bus2  =>  local_rd_addr ( (3*ADDRESS_WIDTH-1)  downto 
addr_read_bus3  =>  local_rd_addr ( (4*ADDRESS_WIDTH-1)  downto 
addr_read_bus4  =>  local_rd_addr ( (5*ADDRESS_WIDTH-1)  downto 
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addr 

read  bus5  => 

local 

rd  addr I 

:  (6*ADDRESS 

WIDTH-1) 

downto 

(5*ADDRESS 

WIDTH) ) , 

addr 

read  bus6  => 

local 

rd  addr  I 

:  (7*ADDRESS 

WIDTH-1) 

downto 

(6*ADDRESS 

WIDTH) ) , 

addr 

read  bus7  => 

local 

rd  addr I 

:  (8*ADDRESS 

WIDTH-1) 

downto 

(7*ADDRESS 

WIDTH) ) , 

addr 

read  bus8  => 

local 

rd  addr  I 

:  (9*ADDRESS 

WIDTH-1) 

downto 

(8*ADDRESS 

WIDTH) ) , 

wrO 

=>  local  wr (0 

downto 

0)  , 

wrl 

=>  local  wr (1 

downto 

1)  , 

wr2 

=>  local  wr(2 

downto 

2)  , 

wr3 

=>  local  wr (3 

downto 

3)  , 

wr4 

=>  local  wr ( 4 

downto 

4)  , 

wr5 

=>  local  wr (5 

downto 

5)  , 

wr6 

=>  local  wr ( 6 

downto 

6)  , 

wr7 

=>  local  wr (7 

downto 

7)  , 

wr8 

=>  local  wr ( 8 

downto 

8)  , 

col_sel  =>  conv_integer (col_selects)  , 


--  Output  ports 

data_write_bus_out  =>  data_writeJous ( ( j *2*DATA_WIDTH  +  (2*DATA_WIDTH- 
1))  downto  j *2*DATA_WIDTH) , 

addr_write_bus_out  =>  addr_writeJous  (  ( j  *ADDRESS_WIDTH  t  (ADDRESSJWIDTH- 
1))  downto  ( j *ADDRESS_WIDTH) ) , 

addr_read_bus_out  =>  addr_read_bus ( ( j *ADDRESS_WIDTH  +  (ADDRESSJWIDTH- 

1))  downto  ( j *ADDRESS_WIDTH) ) , 

wrout  =>  wr ( j  downto  j) 

)  ; 

02 :  RowBankMemory 
port  map ( 

elk  =>  clock, 

raddr  =>  conv_integer (addr_read_bus ( ( j *ADDRESS_WIDTH  +  (ADDRESSJWIDTH-l) ) 
downto  ( j  *ADDRESS_WIDTH)  )  )  , 

waddr  =>  conv_integer (addr_write_bus ( ( j *ADDRESS_WIDTH  +  (ADDRESS_WIDTH-1) ) 
downto  ( j  *ADDRESS_WIDTH)  )  )  , 

data  =>  data_write_bus ( ( j *2*DATA_WIDTH  +  (2*DATA_WIDTH-1) )  downto 

j  *2*DATA_WIDTH)  , 

we  =>  wr ( j  downto  j), 

q  =>  data_readJous ( ( j *2*DATA_WIDTH  +  (2*DATA_WIDTH-1) )  downto 

j  *2*DATA_WIDTH) 

)  ; 

wide_addr_write_out  <=  local_wr_addr; 
wide_addr_read_out  <=  local_rd_addr ; 
wide_data_out  <=  local_wr_data; 
wide_wr_out  <=  local_wr; 

go_control_out  <=  go_control; 
col_selects_out  <=  col_selects; 


end  generate; 


end  rtl; 


library  ieee; 

use  ieee . std_logic_1164 . all; 
use  ieee . numeric_std . all; 


entity  RowBankMemory  is 
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generic 

( 

DATAJWIDTH  :  natural  :=  8; 

ADDR_WIDTH  :  natural  :=  8 

)  ; 

port 

( 

elk  :  in  std_logic; 

raddr  :  in  natural  range  0  to  2**ADDR_WIDTH  -  1; 
waddr  :  in  natural  range  0  to  2**ADDR_WIDTH  -  1; 

--  16  bit  memory 

data  :  in  std_logic_vector ( (2*DATA_WIDTH-1)  downto  0) ; 
q  :  out  std_logic_vector ( (2*DATA_WIDTH-1)  downto 

we  :  in  std_logic  :=  '1' 

)  ; 

end  RowBankMemory; 

architecture  rtl  of  RowBankMemory  is 

--  Build  a  2-D  array  type  for  the  RAM 
--  16  bit  memory 

subtype  word_t  is  std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0) ; 

type  memory_t  is  array (raddr ' high  downto  0)  of  word_t; 

--  Declare  the  RAM  signal, 
signal  ram  :  memory_t; 


begin 


process (elk) 
begin 

if  (rising_edge (elk) )  then 
if (we  =  ' 1 1 )  then 

ram (waddr)  <=  data; 

end  if; 

--  On  a  read  during  a  write  to  the  same  address,  the  read  will 
--  return  the  OLD  data  at  the  address 
q  <=  ram (raddr); 

end  if; 
end  process; 

end  rtl; 


library  ieee; 

use  ieee . std_logic_1164 . all; 

entity  RowMUX  is 
generic 
( 

DATA_WIDTH  :  natural  :=  8; 

ADDRESSJ/JIDTH  :  natural  :=  8 

)  ; 


port 

( 

--  Input  ports 

data  write  busO  :  in  std  logic  vector ( (2*DATA  WIDTH-1) 


downto  0) ; 


data 

write 

busl 

in 

std 

logic 

vector ( ( 2  *  DATA 

WIDTH-1) 

downto 

0) 

data 

write 

bus2 

in 

std 

logic 

vector ( ( 2  *  DATA 

WIDTH-1) 

downto 

0) 

data 

write 

bus3 

in 

std 

logic 

vector ( ( 2  *  DATA 

WIDTH-1) 

downto 

0 ) 

data 

write 

bus4 

in 

std 

logic 

vector ( ( 2  *  DATA 

WIDTH-1) 

downto 

0) 

data 

write 

bus5 

in 

std 

logic 

vector ( ( 2  *  DATA 

WIDTH-1) 

downto 

0 ) 

data 

write 

bus  6 

in 

std 

logic 

vector ( ( 2  *  DATA 

WIDTH-1) 

downto 

0) 

data 

write 

bus7 

in 

std 

logic 

vector ( ( 2  *  DATA 

WIDTH-1) 

downto 

0 ) 

data 

write 

bus8 

in 

std 

logic 

vector ( ( 2  *  DATA 

WIDTH-1) 

downto 

0)  ; 

addr_writeJcusO 
addr_write_busl 
addr_write_bus2 
addr_write_bus3 
addr_writejcus4 
addr_write_bus5 
addr_write_bus6 
addr_write_bus7 
addr  write  bus8 


in  std_logic_vector  (  (ADDRESSJ/JIDTH-1)  downto  0) ; 
in  std_logic_vector  (  (ADDRESSJ/JIDTH-1 )  downto  0) ; 
in  std_logic_vector  (  (ADDRESSJ/JIDTH-1)  downto  0)  ; 
in  std_logic_vector  (  (ADDRESSJ/JIDTH-1)  downto  0) ; 
in  std_logic_vector  (  (ADDRESSJ/JIDTH-1)  downto  0) ; 
in  std_logic_vector  (  (ADDRESSJ/JIDTH-1)  downto  0)  ; 
in  std_logic_vector  (  (ADDRESSJ/JIDTH-1)  downto  0) ; 
in  std_logic_vector  (  (ADDRESSJ/JIDTH-1)  downto  0) ; 
in  std_logic_vector  (  (ADDRESSJ/JIDTH-1)  downto  0)  ; 


addr_read_bus0  :  in 
addr_read_busl  :  in 
addr_read_bus2  :  in 
addr_read_bus3  :  in 
addr_read_bus4  :  in 
addr_read_bus5  :  in 
addr_read_bus6  :  in 
addr_read_bus7  :  in 
addr  read  bus8  :  in 


std_logic_vector 

std_logic_vector 

std_logic_vector 

std_logic_vector 

std_logic_vector 

std_logic_vector 

std_logic_vector 

std_logic_vector 

std_logic_vector 


(ADDRESSJ/JIDTH-1) 

(ADDRESS_WIDTH-1) 

(ADDRESS_WIDTH-1) 

(ADDRESS_WIDTH-1) 

(ADDRESS_WIDTH-1) 

(ADDRESSJ/JIDTH-1) 

(ADDRESS_WIDTH-1) 

(ADDRESS_WIDTH-1) 

(ADDRESS_WIDTH-1) 


downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

wrO 
wr  1 
wr2 
wr3 
wr4 
wr5 
wr6 
wr7 
wr8 


in  std_logic_vector  (0 
in  std_logic_vector  (0 
in  std_logic_vector (0 
in  std_logic_vector (0 
in  std_logic_vector (0 
in  std_logic_vector (0 
in  std_logic_vector (0 
in  std_logic_vector (0 
in  std_logic_vector (0 


downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

downto 

0) 

col  sel 


natural; 


--  Output  ports 

data_write_bus_out  :  out  std_logic_vector ( (2*DATA_WIDTH-1)  downto  0) ; 

addr_write_bus_out  :  out  std_logic_vector  (  (ADDRESSJ/JIDTH-1)  downto  0) 

addr_read_bus_out  :  out  std_logic_vector ( (ADDRESS_WIDTH-1)  downto  0) 

wr_out  :  out  std_logic_vector ( 0  downto  0) 

)  ; 

end  RowMUX; 


architecture  rtl  of  RowMUX  is 
begin 


SEL_PROCESS:  process  (col_sel,  data_write_bus0,  data_write_busl ,  data_write_bus2 
data_write__bus3,  data_write_bus4 , 

data_write_bus5,  data_write_bus6,  data_writeJous7, 
data_write_bus8,  addr_write_bus0,  addr_write_busl, 

addr_write_bus2 ,  addr_write_bus3 ,  addr_write_bus4 , 
addr_write_bus5,  addr_write_bus6,  addr_write_bus7, 

addr_writeJous8,  addr_read_bus0,  addr_read_busl, 
addr_read_bus2,  addr_read_bus3,  addr_read_bus4 , 

addr_readJous5,  addr_read_bus6,  addr_read_bus7, 
addr_read_bus8,  wrO,  wrl,  wr2,  wr3,  wr4,  wr5,  wr6,  wr7,  wr8) 
begin 


case  col_sel  is 

when  0  => 

data  write  bus  out  <=  data  write  busO; 


addr_write_bus_out  <=  addr_write_busO; 
addr_read_bus_out  <=  addr_read_busO; 
wrout  <=  wrO; 
when  1  => 

data_write_bus_out  <=  data_write_busl; 
addr_write_bus_out  <=  addr_write_busl ; 
addr_read_bus_out  <=  addr_read_busl; 
wr  out  <=  wrl; 
when  2  => 

data_write_bus_out  <=  data_write_bus2; 
addr_write_bus_out  <=  addr_write_bus2; 
addr_read_bus_out  <=  addr_read_bus2; 
wr  out  <=  wr2; 
when  3  => 

data_write_bus_out  <=  data_write_bus3; 
addr_write_bus_out  <=  addr_write_bus3; 
addr_read_bus_out  <=  addr_read_bus3; 
wr  out  <=  wr3; 
when  4  => 

data_write_bus_out  <=  data_write_bus4 ; 
addr_write_bus_out  <=  addr_write_Jous4  ; 
addr_read_bus_out  <=  addr_read_bus4 ; 
wrout  <=  wr4; 
when  5  => 

data_write_bus_out  <=  data_write_bus5; 
addr_write_bus_out  <=  addr_write_bus5; 
addr_read_bus_out  <=  addr_read_bus5; 
wrout  <=  wr5; 
when  6  => 

data_write_bus_out  <=  data_write_bus6; 
addr_write_bus_out  <=  addr_write_bus6; 
addr_read_bus_out  <=  addr_read_bus6; 
wr  out  <=  wr6; 
when  7  => 

data_write_bus_out  <=  data_write_bus7; 
addr_write_bus_out  <=  addr_write_bus7; 
addr_read_bus_out  <=  addr_read_bus7; 
wrout  <=  wr7; 
when  8  => 

data_write_bus_out  <=  data_write_bus8; 
addr_write_bus_out  <=  addr_write_bus8; 
addr_read_bus_out  <=  addr_read_bus8; 
wr  out  <=  wr8; 
when  others  => 

data_write_bus_out  <=  data_write_busO; 
addr_write_bus_out  <=  addr_write_busO; 
addr_read_bus_out  <=  addr_read_busO; 
wr  out  <=  wrO; 

end  case; 

end  process  SEL_PROCESS; 


end  rtl; 


library  ieee; 

use  ieee . std_logic_1164 . all; 

--  use  ieee. std_logic_arith. all; 
use  ieee. std_logic_signed. all; 
use  ieee. numeric  std.all; 


entity  TemplateElement  is 

generic 

( 

DATA_WIDTH  :  natural  :=  8; 
ADDRESS  WIDTH  :  natural  :=  8; 
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)  ; 


INITIAL_COL 
INITIAL_ROW 
ARRAY_COL 
ARRAY  ROW 


natural; 

natural; 

:  natural; 
:  natural 


port 

( 


go 

:  in  std  logic 

coeff  : 

in 

std  logic  : =  'O'; 

new  row  in  : 

in 

std  logic  : =  'O'; 

reset  : 

in 

std  logic  : =  'O'; 

elk 

:  in  std  logic 

--  16  bit  memory 

a 

:  in  std  logic 

b 

:  in  std  logic 

data  in  : 

in 

std  logic  vector 

:=  'O' 


data  out 


;or  ( (DATAJWIDTH-l)  downto  0); 
:or  ((DATA_WIDTH-i)  downto  0); 
DATA_WIDTH-1)  downto  0); 
out  std_logic  vector  ( (2*DATAJWIDTH-1)  downto  0) ; 


)  ; 

end  entity; 


addr_in 
addr  out 


out  std_logic_vector ( (ADDRESSJWIDTH-1)  downto  0 )  , 
out  std_logic_vector ( (ADDRESSJWIDTH-1)  downto  0) 


architecture  rtl  of  TemplateElement  is 
--  Build  the  state  machine 

type  state_types  is  (IDLE,  ACC,  STEP_DOWN,  STEP_DOWN_IDLE,  STEP_DOWN_LOAD) ; 

--  Registers  to  hold  the  current  state  and  the  next  state 
signal  present_state,  next_state  :  state_types  :=  IDLE; 

signal  col_counter,  prev_col,  step_col,  step_prev_col  :  std_logic_vector (6  downto 

0)  :=  std_logic_vector (to_unsigned (INITIALJCOL,  7) ) ; 

signal  row_counter,  prev_row,  step_row,  step_prev_row  :  std_logic_vector ( 0  downto 

0)  :=  std_logic_vector (to_unsigned (INITIAL_ROW  /  9,  1));  —  conv_std_logic_vector ( INITIAL_ROW  / 

9,  1) ; 


—  16  bit  memory 

signal  accumulation,  step_accum,  step_data_out  : 

std_logic_vector ( (2*DATA_WIDTH-1 )  downto  0) ; 

signal  go_bit,  go_done,  step_go_done,  load_data,  step_load_data  :  std_logic; 
signal  shift_bit,  shift_done,  step_shif t_done  :  std_logic; 

signal  step_read_write  :  std_logic_vector (0  downto  0) ; 

signal  unsigned_a  :  std_logic_vector ( (DATA_WIDTH)  downto  0) ; 

signal  col_shif t_counter,  step_col_shif t_counter  :  std_logic_vector (3  downto  0)  := 

"0000"; 

—  signal  row_shif t_counter ,  next_row_shif t_counter  :  std_logic_vector (4  downto  0)  := 

"0000"; 

signal  newcol ,  step_new_col  :  std_logic  :=  'O'; 

signal  down_bit,  newrow,  step_new_row  :  std_logic  :=  'O'; 


begin 


--  Force  positive  input  to  avoid  signed  operations 
unsigned_a (DATA_WIDTH)  <=  'O'; 
unsigned_a (DATA_WIDTH-1  downto  0)  <=  a; 

--  Determine  what  the  next  state  will  be,  and  set  the  output  bits 
process  (present_state,  newcol ,  newrow,  reset,  go_bit,  down_bit) 
begin 


case  present_state  is 
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when  IDLE  => 

if  (go_bit  =  '1')  then 

next_state  <=  ACC; 
elsif (down_bit  =  '1')  then 

next_state  <=  STEP_DOWN; 

else 

next_state  <=  IDLE; 

end  if; 


when  ACC  => 

next  state  <=  IDLE; 


when  STEP_DOWN  => 

next  state  <=  STEP  DOWN  IDLE; 


when  STEP_DOWN_IDLE  => 

next  state  <=  STEP  DOWN  LOAD; 


when  STEP_DOWN_LOAD  => 

next_state  <=  IDLE; 

end  case; 

end  process; 

--  Move  to  the  next  state 
process (elk,  reset) 
begin 

if  (reset  =  '1')  then 

present_state  <=  IDLE; 

col_counter  <=  std_logic_vector (to_unsigned (INITIAL_COL,  7) ) ; 
rowcounter  <=  std_logic_vector (to_unsigned (INITIAL_ROW  /  9,  1) ) ; 
accumulation  <=  "0000000000000000"; 

prev_col  <=  std_logic_vector (to_unsigned (INITIAL_COL,  7) ) ; 

prevrow  <=  std_logic_vector (to_unsigned (INITIAL_ROW  /  9,  1) ) ; 

go_done  <=  ' 0 1 ; 

read_write  <=  "0"; 

load_data  <=  ' 0 1 ; 

data_out  <=  "0000000000000000"; 

col_shif t_counter  <=  "0000"; 

newrow  <=  1 0 1 ; 

newcol  <=  ' 0 1 ; 

elsif (elk ' event  and  elk  =  '1')  then 
present_state  <=  next_state; 
col_counter  <=  step_col; 
rowcounter  <=  step_row; 
prev_col  <=  step_prev_col; 
prevrow  <=  step_prev_row; 
go_done  <=  step_go_done; 
read_write  <=  step_read_write; 
load_data  <=  step_load_data; 
data_out  <=  step_accum;  — step_data_out; 
accumulation  <=  step_accum; 

col_shif t_counter  <=  step_col_shif t_counter ; 
new  row  <=  step_new  row; 
newcol  <=  step_new_col ; 

end  if; 
end  process; 
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process (present_state,  data_in,  col_counter,  row_counter,  coeff ,  accumulation,  b, 
unsigned_a,  load_data,  step_accum,  prev_col,  prev_row,  go_done,  col_shif t_counter,  newrowin, 
new_col,  down_bit) 

variable  stepping_row  :  natural  range  0  to  8  :=  1; 

begin 

case  present_state  is 


when  IDLE  => 

step_accum  <=  accumulation; 
step_col  <=  col_counter; 
step_row  <=  rowcounter; 
step_prev_col  <=  prev_col; 
step_prev_row  <=  prevrow; 


no  accumulation  or  data  to  load 
keep  column  counter  constant 
keep  row  counter  constant 


--  16  bit  memory 
step_load_data  <=  load_data; 
step_read_write  <=  "0"; 


then 


if (col_shift_counter  =  std_logic_vector (to_unsigned (ARRAY_COL,  4))) 
step_new_col  <=  1 1 1 ; 


else 


step_new_col  <=  'O'; 


end  in¬ 


step  col  shift  counter  <=  col_shif t_counter ; 


when  ACC  => 

if (down_bit  =  '1')  then 

if (load_data  =  '1')  then 

step_accum  <=  data_in; 
step_load_data  <=  'O'; 

else 

step_accum  <=  accumulation; 


end  if; 

else 


if (load_data  =  '1')  then 

if (coeff  =  '1')  then 

step_accum  <=  data_in  -  b; 

else 

step_accum  <=  data_in  +  unsigned_a; 

end  if; 

step_load_data  <=  'O'; 


else 


if  (coeff  =  '1')  then 

step_accum  <=  accumulation  -  b; 

else 

step_accum  <=  accumulation  +  unsigned_a; 

end  if; 


end  if; 


end  if; 
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if (new_col  =  '!')  then 


step_col  <=  col_counter  +  9; 


step^row  <=  row^counter;  --  keep  row  counter  constant 

step_prev_col  <=  col_counter; 

step_prev_row  <=  row^counter; 

step_read_write  <=  "1"; 

step_load_data  <=  ' 1 1 ; 

if (col_shift_counter  =  "1000")  then 

step_col_shif t_counter  <=  "0000"; 

else 


step_col_shif t_counter  <=  col_shif t_counter  +  1; 

end  if; 


else 


end 


step_col  <=  col_counter; 

step_row  <=  row^counter;  --  keep  row  counter  constant 

step_prev_col  <=  prev_col; 

step_prev_row  <=  prevrow; 

step_read_write  <=  "0"; 

step_load_data  <=  'O'; 

if (col_shift_counter  =  "1000")  then 

step_col_shif t_counter  <=  "0000"; 

else 


step_col_shif t_counter  <=  col_shif t_counter  +  1; 

end  if; 


if; 


step_new_col  <=  'O'; 


when  STEP_DOWN  => 

step_accum  <=  accumulation; 

step_col  <=  std_logic_vector (to_unsigned ( INITIAL_COL, 
step_col 1  length) ) ;  --  return  column  counter  to  initial  position 

if (ARRAY_ROW  =  stepping^row)  then 

step_row  <=  row_counter  +1;  --  increment  row  counter 

else 

step_row  <=  row_counter; 

end  if; 

if (stepping_row  =  8)  then 
stepping_row  :=  0; 

else 

stepping_row  :=  stepping_row  +  1; 

end  if; 

step_prev__col  <=  col_counter; 
step_prev_row  <=  rowcounter; 
step_read_write  <=  "1"; 
step_load_data  <=  ' 1 ' ; 

step_col_shif t_counter  <=  "0000";  --  col_shif t_counter ; 
step_new_col  <=  'O'; 

when  STEP_DOWN_IDLE  => 

step_accum  <=  accumulation; 
step_load_data  <=  'O'; 
step_col_shift_counter  <=  col 
step_col  <=  col_counter; 
step_row  <=  rowcounter; 
step_prev_col  <=  prev_col; 
step_prev_row  <=  prevrow; 
step_read_write  <=  "0"; 
step_load_data  <=  'O'; 
step_new_col  <=  ' 0 1 ; 


shif t_counter ; 

--  keep  column  counter  constant 
--  keep  row  counter  constant 
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when  STEP_DOWN_LOAD  => 

step_accum  <=  data_in; 
step_load_data  <=  'O'; 

step_col_shif t_counter  <=  col_shif t_counter ; 

step__col  <=  col_counter;  --  keep  column  counter  constant 

step_row  <=  row_counter;  --  keep  row  counter  constant 

step_prev_col  <=  prev_col; 

step_prev_row  <=  prev_row; 

step_read_write  <=  "0"; 

step_load_data  <=  '  0 '  ; 

step_new_col  <=  'O'; 

end  case; 
end  process; 


--  Internal  Control  Process 

--  This  process  modifies  the  go  command  from  the  processor  so  that 
--  only  a  single  increment  can  occur  per  toggle  of  the  go  signal 
--  with  the  goal  of  eliminating  timing  issues  or  gliches  where 
--  no  accumulation  or  multiple  accumulations  of  the  same  data 
—  occurs . 


--  This  process  drives  the  following  signals: 

step_go_done :  this  sets  the  go_done  signal  to  the  value  of 

the  go  input  so  that  the  internal  go_bit  signal 
is  set  when  an  accumulation  has  not  yet  occured 
and  cleared  when  the  accumulation  state  is  reached 
once. 


process (present_state,  go,  go_done,  new^row,  newrowin) 
begin 

if (present_state  =  ACC)  then  --  Accumulation  state? 

step_go_done  <=  go;  —  Set  go_done  to  the  go  bit  on  next  clock  cycle 
step  new^row  <=  new  row; 
elsif (present_state  =  STEP_DOWN)  then 
step  newrow  <=  new  row  in; 
step_go_done  <=  go_done; 

else 

step  new^row  <=  new  row; 

step_go_done  <=  go_done;  --  Any  other  state  keeps  go_done  constant 

end  if; 
end  process; 

down  bit  <=  newrow  xor  new  row  in; 
go_bit  <=  go_done  xor  go; 
addr_in(7  downto  7)  <=  rowcounter; 
addr_in(6  downto  0)  <=  col_counter; 
addr_out(7  downto  7)  <=  prev^row; 
addr_out(6  downto  0)  <=  prev_col; 


end  rtl; 


library  ieee; 

use  ieee . std_logic_1164 . all; 
use  ieee . std_logic_arith . all; 
use  ieee . std_logic_signed. all; 

entity  FilterRowController  is 


generic 

( 


ROW  INIT  :  natural  :=  0 


port 

( 

--  Input  ports 
go  :  in  std_logic; 

elk  :  in  std_logic; 

--  Output  ports 
go_cmd  :  out  std_logic; 

oper  :  out  std_logic_vector ( 8  downto  0)  :=  "111000111"; 

new_col_out  :  out  std_logic_vector ( 8  downto  0) ; 

new_row_out  :  out  std_logic_vector ( 8  downto  0) ; 

col_sel  :  out  std_logic_vector (3  downto  0) 

)  ; 

end  FilterRowController; 


architecture  rtl  of  FilterRowController  is 


subtype  Array_Control  is  std_logic_vector ( 8  downto  0) ; 
subtype  Col_Sel_Control  is  std_logic_vector (3  downto  0)  ; 

type  Sof tware_Array_Type  is  array (9  downto  1)  of  Array_Control ; 
type  Col_Sel_Sof tware_Type  is  array (8  downto  0)  of  Col_Sel_Control 

signal  Oper_Control_LUT  :  Sof tware_Array_Type; 
signal  Col_Control_LUT  :  Sof tware_Array_Type; 
signal  Row_Control_LUT  :  Sof tware_Array_Type; 

--  signal  Col_Sel_Control_LUT  :  Col_Sel_Software_Type; 


begin 


— Oper_Control_LOT (0)  <=  "000000000" 
Oper_Control_LUT (1)  <=  "111000111"; 
Oper_Control_LUT (2)  <=  "110001111"; 
Oper_Control_LOT (3)  <=  "100011111"; 
Oper_Control_LOT (4)  <=  "000111111"; 
Oper_Control_LUT (5)  <=  "001111110"; 
Oper_Control_LOT ( 6)  <=  "011111100"; 
Oper_Control_LUT (7)  <=  "111111000"; 
Oper_Control_LOT (8)  <=  "111110001"; 
Oper_Control_LOT (9)  <=  "111100011"; 

--Row_Control_LUT (0)  <=  "000000000"; 
Row_Control_LUT ( 1 )  <=  "000000001"; 
Row_Control_LOT (2)  <=  "000000011"; 
Row_Control_LUT (3)  <=  "000000111"; 
Row_Control_LOT (4 )  <=  "000001111"; 
Row_Control_LUT (5)  <=  "000011111"; 
Row_Control_LUT (6)  <=  "000111111"; 
Row_Control_LUT (7)  <=  "001111111"; 
Row_Control_LUT ( 8 )  <=  "011111111"; 
Row_Control_LOT ( 9)  <=  "111111111"; 


--Col_Control_LUT (0)  <=  "000000000" 
Col_Control_LUT (1)  <=  "000000001"; 
Col_Control_LUT (2)  <=  "000000010"; 
Col_Control_LUT (3)  <=  "000000100"; 
Col_Control_LUT (4)  <=  "000001000"; 
Col_Control_LOT (5)  <=  "000010000"; 
Col_Control_LUT (6)  <=  "000100000"; 
Col_Control_LUT (7 )  <=  "001000000"; 
Col  Control  LUT(8)  <=  "010000000"; 


Col_Control_LUT (9)  <=  "100000000"; 


process (elk,  go) 

variable  step_counter  :  integer  range  1  to  9  :=  1; 
variable  step_down__counter  :  integer  range  0  to  127  :=  0; 
variable  row_step_counter :  integer  range  0  to  9  :=  0; 

variable  col_sel_counter  :  integer  range  0  to  8  :=  0; 

variable  three_clk  :  integer  range  0  to  2  :=  0; 

variable  two_clk  :  integer  range  0  to  1  :=  1; 

variable  go_toggle  :  std_logic  :=  'O'; 

begin 


iffclk'event  and  elk  =  '1')  then 

if  (step_down_counter  =  127)  then 

oper  <=  Oper_Control_LUT (1) ; 

new_col_out  <=  Col_Control_LUT ( 1 ) ; 

new_row_out  <=  Row_Control_LUT (row_step_counter) ; 

col_sel  <=  conv_std_logic_vector (row_step_counter ,  4); 

if (three_clk  =  2)  then 

if (row_step_counter  =  9)  then 
step_down_counter  :=  0; 
row_step_counter  :=  9; 
step_counter  :=  1; 
col_sel_counter  :=  0; 

else 

row_step_counter  :=  row_step_counter  +  1 
three_clk  :=  0; 

end  if; 

else 

three  elk  :=  three  elk  t  1; 


end  if; 


else 

oper  <=  Oper_Control_LUT (step_counter) ; 
new_col_out  <=  Col_Control_LUT ( 1 ) ; 

col_sel  <=  conv_std_logic_vector (col_sel_counter,  4); 
new_row_out  <=  Row_Control_LOT (row_step_counter ) ; 

if (go  =  ' 1 ' )  then 

go_toggle  :=  not (go_toggle) ; 
go_cmd  <=  go_toggle; 


step_down_counter  :=  step_down_counter  t  1; 

if  (step_counter  =  9)  then 
step_counter  :=  1; 

else 

step_counter  :=  step_counter  t  1; 

end  if; 

if (col_sel_counter  =  9)  then 
col_sel_counter  :=  1; 

else 

col_sel_counter  :=  col_sel_counter  t  1; 

end  if; 

two  elk  :=  0; 


else 


go_cmd  <=  go_toggle; 


end  if; 


end  if; 
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end  if; 
end  process; 
end  rtl; 


